Updated with Statement from Amazon: Amazon’s S3 cloud storage service went offline this morning for an extended period of time — the second big outage at the service this year. In February, Amazon suffered a major outage that knocked many of its customers offline.
It was no different this time around. I first learned about today’s outage when avatars and photos (stored on S3) used by Twinkle, a Twitter-client for iPhone, vanished.
My big hope was that it would come back soon, but popular S3 clients such as SmugMug were offline for more than eight hours — an awfully long time for Amazon’s Web Services division to bring back the service. As our sister blog, WebWorkerDaily, points out:
With two relatively serious outages in the space of 6 months, some will be asking the question of why depend on S3? The answer is simple: the rates are hard to beat, especially for service that doesn’t require any sysadmin budget.
That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.
Update: Antonio Rodrigez, founder of Tabblo, now part of HP, on his blog asks the $64,000 pertinent question:
…if AWS is using Amazon.com’s excess capacity, why has S3 been down for most of the day, rendering most of the profile images and other assets of Web 2.0 tapestry completely inaccessible while at the same time I can’t manage to find even a single 404 on Amazon.com? Wouldn’t they be using the same infrastructure for their store that they sell to the rest of us?
Update #2: Building an offline redundancy for Amazon S3 could be big opportunity, Dave Winer says.
Update #3: A reader sent me an email and asked these two questions
- Is the system designed to be fault tolerant? If yes, then how did it go down? After all they must have massive arrays and mirrors of their storage infrastructure.
- Is this a hardware failure or a software/design problem?
Random Thought: The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure – routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code.
Update/Statement from Amazon in response to our questions:
As a distributed system, the different components of S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to.
We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests. After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again.
These are sophisticated systems and it generally takes a while to get to root cause in such a situation—we will be providing our customers with more information when we’ve fully investigated the incident. We’re proud of our operational performance in operating S3 for almost 2.5 years, and our customers have generally been pleased with the reliability and performance of the service. But any downtime is unacceptable and we won’t be satisfied until it is perfect.
Amazon S3 is used heavily by a number of services behind Amazon’s retail websites. Those services were impacted, but the retail website did not show noticeable problems because it mostly uses cached data.
73 thoughts on “S3 Outage Highlights Fragility of Web Services”
Good report Om.
Maybe the problem isn’t only with the webservices. The problem may be with those who hire an extremely cheap service expecting to have the same uptime of a more expensive service.
I saw Scoble interview with a guy with HP research and he pointed out that to make cheaper and greener data centers we should review the way we work on SLAs. Why have an extremely reliable (and expensive) platform when you are not storing and processing extremely vital data? Twitter profile pictures are not so vital, are they?
I work on a small startup that is moving part of its databases and backups to Amazon. For the price, we don’t expect 100% SLA at all. For the part of our applications we need all kinds of redundancies in place, we now we will keep paying more.
Sao Paulo – Brazil
Amazon never said that they shared the AWS infrastructure with Amazon e-commerce; the man said that they used the same design (on his high availability blog).
The industry needs not only off-line mirroring of S3 type services, but third party certification and a pool of insurers for business continuity ratings and policies.
Did you just give away a business idea 🙂
Also, I think the point Antonio is making is the “legend” part of the story. It would be worth asking them the tough question. 🙂
So, sometimes, your “real” hosting platform goes down. I just happens…
You complain, and you get a (usually tiny) refund. I think the real $64,000 question is:
Do you get better availability with AWS or your hosted platform?
(I don’t use AWS, but am interested in the responses)
I like these types of writeups, just from the title I knew its one by Om and hence worth reading.
To answer Antonio Rodrigez’s question: At CloudCamp a few weeks/months ago I asked the Amazon WS guy this exact question. His natural answer, which is what we all know and fear, is that even though S3/EC2 are the result of work on Amazon.com’s own infrastructure, they are in-fact separate services which Amazon doesn’t yet entrust with its own web assets.
Regardless of this I believe these services have a relatively great uptime and that they will improve. The community is reacting to AWS current position by using them only for background computational tasks and hosting stuff like avatars/images. SmugMug took a risk and got burned. When they get better we’ll see entire deployments moving over.
Interesting times and I am sure that every IT department has experienced serious outages ranging from 4 – 8 hours or greater and at what cost. SLA’s will become imperative to future SaaS based technologies and how we measure SLA’s will also change. I believe that Amazon is just touching the top of the iceberg with S3 and they will continue to build the redundancy, improve the architecture and moderate SLA’s to support the future of web computing.
In working with CIO’s around the world, we have been talking about the power of smart technologies like S3 and other companies like Salesforce.com and OpSource where they are creating real value for their customers. Given the history of Amazon, I would expect them to minimize the downtimes and provide an over arching architecture to support not only the future growth but reliability that Amazon has proven over the last decade.
McNealy and Ellison were the proponents of Network is the Computer – now called cloud computing. Grove said it would never work. Do not remember Steve ever talking about it.
The issue is of trust and reliability- localized storage/computing is generally better. Having said that the market will be there.
I think Bezos is having a lot of fun with Amazon. He is the number 1 entrepreneur even ahead of guy who started EBAY. Take that Google and Yahoo boys
I think you might be onto something, but I think McNealy was way ahead of these guys. I think the subtle difference is that now “network is the corporation”
I think cloud computing evolution is still in the “first inning.”
Flavio has a good point. Are people not getting what they pay for here?
After all, the core of the internet is fully redundant and if the connection is that important you can have dual connections to it from your office (at a price).
So the Amazon problem looks to be one of 2 things.
1. The system has been built on the cheap to meet a very low price point with an SLA to match. Yet people have signed up to this service knowing that there is no SLA.
2. Amazon have built this thing. It is not their core business and as such, they really are not the best people to build and deliver enterprise class services (yet). They do not have enough experience. And building your own large platform does not automatically qualify you to build large systems for other people.
S3 is NOT an enterprise application. It is NOT designed for VERY important data.
If it was, it would carry an SLA to match.
I think it would be really interesting to compare uptime versus level of control and cost.
For example, if you pay for and support your own data-centres, can you achieve better uptime than what you get from S3 and EC2? Does the fact that you “control” the data-centres allow you to get back online faster than Amazon fix-up S3 or EC2?
Note that your costs in going alone are more than just data-centre space, hardware and admin because you’ll have to architect your own storage solution as well, designing in fault-tolerance scaling etc.
The reason Amazon.com didn’t have 404’s is the same reason WordPress.com didn’t have 404’s… they built their site the right way… redundant. Twitter and friends don’t have that level of redundancy.
I did a little research on it last night in a blog post (click my name to visit my blog… won’t spam it here).
I’m a little disappointed to see someone raise that question, as anyone in the business should be well aware of cost/benefits in terms of infrastructure and how to be redundant.
Grids don’t count as redundancy. They are just a higher reliability system. 2 separate grids can be considered redundant.
I don’t think that is really the $64,000 question, redundancy fails as well. Look at Microsoft/Ebay/Google, they have all built entirely redundant platforms but they still suffer downtime. This is an inevitable fact in the hosting world and not something people should be concerned about. The things people should be concerned about is A) how much downtime and B) how goes your provider respond.
Web services on the cloud are perfectly reliable these days, compare it to the physical machines you are on. Our physical gadgets have tons of downtime but I don’t see anyone shunning their dell laptop because of it.
This is scary for anyone counting on Amazon for their hosting without redundancy. Also, their support for AWS is terrible. When you call them it takes you to the Amazon.com support center and they have no clue about AWS. Its a lot like Google Apps (from a support point of view). No direct line, and email support is very poor.
I think the redundancy point is key. You can’t rely on S3 without having a failover option. Maybe this will boost business for Nirvanix as a backup option. Although at that point does it make sense to build your own system?
@Ross, the issue with owning my own system is that it’s a single point of failure for only my own site, but with S3 or The Planet or Rackspace or any other cloud or host having problems, it’s a single point of failure for multiple systems, which makes the impact larger. Given that, should the site using a host have a Plan B in case the host fails, or should the host have a completely separate and redundant system to minimize a worst case scenario? That’s pretty costly.
While I don’t have hard data to support this, I would bet that far more S3 customers were not affected by this outage than were. It strikes me that even though S3 has been around now for a few years, it’s still the relatively aggressive early-adopters (like SmugMug and Twitter) that have pushed large volumes of primary, online storage into S3.
For what I’m guessing is the silent majority, S3 *is the backup*. At Wikispaces we use S3 in this manner, encrypting and backing up our data in real time. If our already-replicated local storage gets smashed by Godzilla, we can fall back to S3. This outage was a non-issue, our backup systems just caught back up when S3 came online.
Om and company, care to gather some polling data? I’d want to know volume of data stored by your readers and whether it’s online (S3 is primary storage) or offline (S3 is backup / archival storage).
Two points about outage (continuing on the point made by jbyers)
a) Companies should reflect on how they use cloud storage – as a cdn to deliver content or as a primary datastore or as a secondary datastore/backup. If it is more of a secondary datastore/backup, then this outage shouldn’t have affected since most of software accessing the cloud should have resiliency built in (we did and did not experience the outage, check out the blog to see how we handled it:
On the other hand, if you are using S3 as a CDN or primary store,…
b) Always have a plan B). This may not be always cost-effective but decide on what are your primary gold services that need to be always working and device appropriate safeguards to protect your business.
There are alot of better services out there such as Nirvanix,but it just seems like this is happening way too much with S3, these guys gotta get it together. Maybe they should just stick to selling baby diapers and shoes, not storage.
Moving to a “cloud” solution or SaaS has great benefits and we are seeing some of the downside. Outages can happen and do happen but we need to have a realistic expectations. S3 will not be up %100 of the time and what businesses need to ask themselves is “How much pain can we take if our primary service provider goes down?” and is that pain worth buying a secondary solution.
To me if s3 goes down and I can’t see twitter profile images then that isn’t a big deal. If I want that level of service then maybe we should turn twitter into a pay model so they can afford that. However in the case of Scribd this is a much more serious problem as it rendered their site completely useless. But this isn’t really an s3/cloud issue. If the RackSpace data center in Dallas goes down so does 37signals, they could replicate themselves in another data center but this isn’t cost effective. If anyone’s primary service provider goes down they will be in trouble whether they are in a cloud or on a dedicated cluster in a data center.
For full disclosure, I am with Nirvanix. This is no place for an ad, but we know you will be pleased with our Storage Delivery Network. To make the process of file and folder migration easier, we have made a tool software you may download here: http://developer.nirvanix.com/files/folders/applications/entry886.aspx
For full disclosure, I am with Nirvanix. This is no place for an ad,
but we know you will be pleased with our Storage Delivery Network. To
make the process of file and folder migration easier, we have made a
software tool you may download on our website and begin the automated
When I first started using S3 I thought it was the most amazing thing ever. Within 24 hours, I had my entire web imaging system (heavy user contribution) using s3. However, I wish amazon’s .NET methods had
included some setup in web.config that would have specified a local cache folder should their service ever go down. I’m programming this myself now for collarfree.com, and using a getFileLink() method around any s3 hosted file. It checks s3 status every 5 minutes. If s3 is down, it uses the local cache for the next 5 minutes.
Essentially, it adds our own local servers to the ‘cloud’. I’m surprised s3 didn’t provide this feature in the first place.
We all know that outages will happen and will continue to happen. This is not unique to Cloud Computing nor traditional hosting. I should know as I work at GoGrid (http://www.gogrid.com) that provide Cloud Infrastructure. GoGrid’s parent company is ServePath (a managed services hosting provider) so we have years of traditional hosting experience.
I won’t get on my soapbox to say that any product is better than another, there are fine distinctions between all of them. What really matters is that you choose your platform (e.g., dedicated or cloud) carefully and that is completes and even compliments your business needs. Cloud Computing will continue to evolve and as it does, standards (e.g., for mirroring or failovers across different providers) will emerge. But you need to look beyond the technology and at the Support, the SLA (if any exist) and the ancillary services as well.
Thanks Om for this article as it points to the growing pains in the industry, but again, this is no different than a company hosted traditionally experiencing a fiber cut or power outage. The proper thing that IT managers or developers need to think about is how to cleverly architect their offerings in a way to minimize single points of failure. There will be tools or companies popping up all over that will help with this. It may be a bit blurry for a while so the important thing is to continue communicating and keeping this discussion open to the community.
Technology Evangelist – GoGrid.com
One point that this emphasizes is that stuff happens.
I read that one or two of the affected companies chose to put up a cute embedded flash game since none of the features of their application were working.
I commonly hear from customers that they count on their service provider to monitor their online systems sort of like grading their own work in school.
My feeling, as an employee of a company that provides remotely hosted website monitoring services, is that trust is good, but it is important to independently verify what levels of service are being delivered to your customers using a service like AlertSite.
I just posted some thoughts on “Cloud Availability” at http://mukulblog.blogspot.com/2008/07/cloud-availability.html . Love to hear your comments on the same.
Do you think that the Amazon services has had less or more downtime over the year when compared to an average enterprise solution ?
Despite what many pundits have to say, reliability issues will not be the downfall of cloud computing. Using cloud computing does not mean neglecting to architect solutions that meet their business requirements, including reliability requirements.
I wrote more about this idea here:
Cloud Computing and Reliability