Amazon’s S3 cloud-based storage service went down earlier this morning, according to numerous tips we’ve received. The service has impacted many companies, including folks like Twitter. According to our tipsters, the service went down around 4:30 a.m., and is showing a 500 Internal Server Error message.
Amazon Web Services forums are full of people chatting about the outage. One poster on the forum summed up the situation nicely, saying, “The s3 service is great but this just proves you can’t rely on it, this is a major issue especially since it’s been down for so long. Way to go Amazon.”
This outage, one of the first large-scale problems to hit Amazon, shows that a lot of work needs to be done before we can completely rely on the cloud. As I have often said, we are running the 21st century web on infrastructure that was dreamed up in the 1990s, long before the web’s current scale. Still, that doesn’t take away my long-standing enthusiasm for Amazon’s web services strategy.
We will keep you posted. Meanwhile, let us know how you have been impacted and what you are doing to build the redundancy of your web service.
Nick Carr has his take on the situation. “Given that entire businesses run on S3 and related services, Amazon has a particularly heavy responsibility not only to fix the problem quickly but to explain it fully,” he writes. I agree with him, and hopefully Amazon will do the needful. Amazon says it is fixed it, but there seem to continuing problems with the service, as the forum indicatess.
Its back up now. We get most of our traffic from India and unfortunately for us, this happened during near peak hours – 6 in the evening. We use AWS for images, but the system defaults to our internal server when it fails. We had been thinking of doing away with the fail-over given how well AWS worked, but ofcourse, that wouldn’t happen anytime soon now
Some one check if Rackspace went down today or not. It appears that “downtime trouble” follows Twitter where ever they go!
@Adnan,
That is funny. I am betting that TWitter people will not admit their own shortcomings and how badly their system is architected. It is always the hosting company which is to blame.
digg here – http://digg.com/hardware/Amazon_S3_World_s_most_reliable_web_service_is_DOWN
We’ve gotten so good at reducing adoption friction, that we’ll see a lot of this kind of thing. It just isn’t possible to plan for it.
More on my blog:
http://smoothspan.wordpress.com/2008/02/15/google-reports-iphone-usage-50x-other-handsets-amazon-s3-goes-down-low-friction-has-a-cost/
Best,
BW
“…Amazon will do the needful.”
Om, you did not just use that word…needful.
I use JungleDisk to backup my iPhoto library to Amazon S3 nightly. No data was lost ( on my end ) but I did notice that JungleDisk had to backup the entire iPhoto library and not just the new files.
I’m not happy this outage happened, but we may be better off for it as an industry. There’s so much hype about the possibilities of the cloud right now that we’re overlooking some of the service-level requirements that it may or may not meet. Amazon could inadvertently become a test case that will be studied by other enterprises who are considering moving their infrastructure over.
One of our clients sites was down for a while, due to this outage. Seems to be back up. They did say that other than this, the service has been great. We are working on an upcoming project and are pretty sure we are going to use AWS…Definitely going to do more diligence on this and see what the explanation is for it. I look forward to seeing the reason.
Matt
We are only one major outage away from certain marquee clients swearing off sole reliance on SAAS. This happened to a mid-sized automotive auction, a client, that had with my help knit together a network of dealers, contractors, and agents, into a system with a zero install, zero hosting footprint.
UNTIL:
There were four accounts that were mashed up…the usual suspects, and one of them went dark. We did some pinging (here is a good business idea for a bright Web20 person, third party app monitoring and governance) and isolated the guilty party.
In spite of being punked, fingered, whatever, the slacker who ran the service were very rude and unforthcoming. That’s another problem: who are you going to deal with when these hosted services go down? I’m not so sure if it was SalesForce that crapped out, that it would have been better.
Long and short of it: we have a business community that is used to local control, we consultants want to deliver apps as a service – we will need to ally ourselves with the providers of these services to come up with a game plan…but try and get one of the stars to cough up a retainer!
Most of the startup SaaS guys laugh when I propose a contract to consult on packaging and policies for reliability for the SMB end users.
But this is exactly what they should want, guys like me who bea the bushes for them.
Amazon’s SLA for S3 is 99.9% uptime during a billing month. That’s 0.723 hours of allowable downtime.
See the “Justin Etheredge Offers Preview of LINQ to [Amazon] SimpleDB” topic of http://oakleafblog.blogspot.com/2008/02/linq-and-entity-framework-posts-for_11.html.
–rj
Cloud based storage is getting alot of heat today, and since its web centric any amount of downtime is unacceptable. The situation today should not put cloud storage in a bad state, other companies such as Nirvanix have storage nodes around the world with no single point of failure with helps in avoiding situations like today.If your relying on a single point for critical data you’ve got a major problem.
I also advised the auto auction that they should invest in the VSAT data services that only charge for rent of the equipment, and any fail-over data transmission, but they balked at the cost.
I told them no matter how reliable (and generally, hosted services are more reliable than a mid-sized businesses owned plant)one local loop for data was no way to run a business. They ran their auction, live, cashier functions and all, on SAAS.
Eventually, their link did go down, and it had nothing to do with the SAAS providers. Now, they have bonded SDSL from two carriers that can split when one goes down.
So many ways to fail.
Sure – it’s a bummer when a cloud based storage system fails. In the same way that it’s awful when the power goes out. But claims that this sort of outage will harm the ascendancy of cloud computing are akin to claims that power cuts make more likely a return to gaslights and steam powered manufacturing.
Wow. Just wow. Everyone out there jumping up and down just needs to relax. Go outside, call your mother, step away from the computer, go to the gym, read a book (and not on kindle). I’m prompted to write this in light of the recent Blackberry outage. Again, a few hour outage gets coverage all over the web and on tv as well. I couldn’t believe the bb outage was covered in depth on cnbc.
Frankly we’re all lucky this stuff even works at all. Go hug your kids or the person to your left.
In our early beta version, we are using some of Amazon’s web services (namely S3 and SimpleDB) – but have been considering using our own storage and database instead. With today’s outage I’m not sure if AWS is a great strategy for us.
We don’t have huge amounts of data to store like some companies (smugmug comes to mind), so using AWS was mostly for the peace of mind that we would be able to scale quickly after our beta goes public and all of Digg’s users abandon them for us . We have a meeting tomorrow to take a closer look at our strategy for handling lots of new traffic in a short period of time, and I have to say that it doesn’t seem likely that Amazon will be included in the party.
Despite the few hours downtime, it’s still one of the best available and reliable web services, to date…
Hmmm, I have back episodes of my podcast stored on S3, so this is a disservice to potential new subscribers. I hope Amazon fixes this soon!
http://soundsgoodpodcast.com
If you ever made an effort to read slides from SmugMug’s chief Don MacAskill (he removed PDF from site, so you can only get it from web.archive.org here http://web.archive.org/web/20070406174427/http://blogs.smugmug.com/don/files/ETech-SmugMug-Amazon-2007.pdf or same link shorter http://tinyurl.com/33t27f ), it starts from nice photo in Amazon data center after major fire. And then goes further here and there, stating that author’s company does NOT count on Amazon’s 100% reliability and does NOT advice to do same to others.
@A.T.
Actually, if you bothered to read my slides, let alone my blog posts and other coverage, you’d know that:
What I did say is that no service, hardware, or software we’ve ever used is 100% and that Amazon is no different. Depend on it, fine. I do. But expect miracles? That’s just stupid.
Sorry about the slides being missing, that was an accident. They’ve been restored.
We’ve gotten so good at reducing adoption friction, that we’ll see a lot of this kind of thing. It just isn’t possible to plan for it.
It affected me a little bit – one of my subcontractors relies on AWS for file hosting, and so it was a temporary problem for me.
That said, everything goes down. It is incumbent on you to not rely on one service, period. You wouldn’t rely on one spindle of a hard drive; you’d backup. Having multiple options is not only prudent but required, especially when using third-parties as everything will fail at some point and nothing, nothing is going to be 100% uptime, even internal systems you own completely yourself. That’s a very false sense of security.
I like AWS and still would recommend it. Now, if this becomes a habit, then, maybe that might change.