Ever since I got BlackBerry 8900 with a 3.2 Megapixel camera, I’ve been busy taking photos -– randomly at times -– and uploading them to my Facebook account to share with 2,000 or so of my closest friends. Apparently I’m just one of millions of people who upload nearly 220 million images to Facebook every week.
In a blog post today, Facebook shares some secrets of its photo infrastructure, which is based on its core innovation, Haystack. First let me give you some fun facts about Facebook Photos, which will help you understand why what they’ve done is so impressive.
- Facebook users have uploaded more than 15 billion photos to date, making it the biggest photo-sharing site on the web.
- For each uploaded photo, Facebook generates and stores four images of different sizes, which translates into a total of 60 billion images and 1.5 petabytes of storage.
- Facebook adds 220 million new photos per week or roughly to 25 terabytes of additional storage.
- At the peak there are 550,000 images served per second. (See more in this video.)
Growth at such speeds made it almost impossible for Facebook to solve the scaling problem by throwing more hardware at it; they needed a more creative solution. Enter Doug Beaver, Peter Vajgel and Jason Sobel –- three Facebook engineers who came up with the idea of Haystack Photo Infrastructure.
“What we needed was something that was fast and had the ability to back up data really fast,” said Beaver in an interview earlier today. The concept they came up with was pretty simple and yet very powerful. “Think of Haystack as a service that runs on another file system,” explained Beaver. It is a system that does only one thing -– photos -– and does it very well. From the Facebook blog post:
The new photo infrastructure merges the photo serving tier and storage tier into one physical tier. It implements a HTTP based photo server, which stores photos in a generic object store called Haystack. The main requirement for the new tier was to eliminate any unnecessary metadata overhead for photo read operations, so that each read I/O operation was only reading actual photo data (instead of filesystem metadata).
The Haystack infrastructure is comprised of commodity servers. Again, from the post:
Haystack is deployed on top of commodity storage blades. The typical hardware configuration of a 2U storage blade is: 2 x quad-core CPUs, 16GB – 32GB memory, hardware raid controller with 256MB – 512MB of NVRAM cache and 12+ 1TB SATA drives
Typically when you upload photos to a photo-sharing site, each image is stored as a file and as a result has its own metadata, which gets magnified many times when there are millions of files. This imposes severe limitations. As a result, most end up using content delivery networks to serve photos — a very costly proposition. By comparison, the Haystack object store sits on top of this storage. Each photo is akin to a needle and has a certain amount of information — its identity and location — associated with it. (Finding the photo is akin to finding a needle in the haystack, hence the name of the system.)
That information is in turn used to build an index file. A copy of the index is written into the memory of the “system,” making it very easy to find and in turn serve files at lightening-fast speeds. This rather simple-sounding process means the system needs just a third the number of I/O operations typically required, making it possible to use just one-third of the hardware resources — all of which translates into tremendous cost savings for Facebook, especially considering how fast they’re growing.
Next time I upload a photo, I will be sure to remember that.
Photo courtesy of Flickr.
Even more impressive is how fast you can key through Facebook photos. They next few photos appear to be pre-loaded.
Now I wish they’d start displaying some of the EXIF metadata that is being stored in the image files by digital cameras. It’s odd that even now they don’t extract the photos timestamp or location.
Guess they forgot to mention how much it costs them to store each photo.
Nice article Om with haystack details…..
However, I’m surprised that if they really have 15B photos with 4 copies for each (60B), it is really only 1.5 petabytes of storage….I have heard that companies like Shutterfly (which has much less images < 1/5th), uses up storage of more than 5 to 6 petabytes.
It clearly could be due to the size of the image they store where Facebook might be doing it at a much lower resolution but it also means that they are going to have a real tough time monetizing all those pictures since they might not be print/memory worthy quality. I’m sure traffic through all those pics with advertising is worth something but if people upload 220M images a week…..i see a missed goldmine oppty for monetization.
@Raghu They definitely shrink them a lot and they even do it locally so as not to incur the bandwidth costs of uploading full high-res images. I typically upload pictures to flickr and Facebook at the same time and the FB upload takes a small fraction of the time.
Here’s a post I wrote a while back comparing Facebook photos and flickr:
http://blog.agrawals.org/2007/10/10/the-power-of-the-social-graph/
With regard to monetization, I think they know their audience. We don’t believe in print. 🙂 The last time I printed a picture was years ago. I’ve got pictures online, pictures on my iPhone, pictures for a screensaver. I love pictures.
How Facebook Serves Up Its 15 Billion Photos? A techie-must-read see http://mashable.com/2009/04/30/facebook-photo-sharing/ and see details here http://www.facebook.com/FacebookEngineering#/note.php?note_id=76191543919&ref=mf
If anyone is interested in easier photo uploads, the guys at http://www.tagle.it are running a beta trial for a photo uploader that sends photos (and tag metadata) to Facebook, Flickr, Picasa and other sharing sites. The interface could use some work, but a big plus is that it keeps track of what was uploaded.
See http://www.tagle.it/photouploader