73 thoughts on “S3 Outage Highlights Fragility of Web Services”

  1. Good report Om.

    Maybe the problem isn’t only with the webservices. The problem may be with those who hire an extremely cheap service expecting to have the same uptime of a more expensive service.

    I saw Scoble interview with a guy with HP research and he pointed out that to make cheaper and greener data centers we should review the way we work on SLAs. Why have an extremely reliable (and expensive) platform when you are not storing and processing extremely vital data? Twitter profile pictures are not so vital, are they?

    I work on a small startup that is moving part of its databases and backups to Amazon. For the price, we don’t expect 100% SLA at all. For the part of our applications we need all kinds of redundancies in place, we now we will keep paying more.

    Sao Paulo – Brazil

  2. Amazon never said that they shared the AWS infrastructure with Amazon e-commerce; the man said that they used the same design (on his high availability blog).

    The industry needs not only off-line mirroring of S3 type services, but third party certification and a pool of insurers for business continuity ratings and policies.

  3. @Alan

    The industry needs not only off-line mirroring of S3 type services, but third party certification and a pool of insurers for business continuity ratings and policies.

    Did you just give away a business idea 🙂

  4. So, sometimes, your “real” hosting platform goes down. I just happens…

    You complain, and you get a (usually tiny) refund. I think the real $64,000 question is:

    Do you get better availability with AWS or your hosted platform?

    (I don’t use AWS, but am interested in the responses)

  5. I like these types of writeups, just from the title I knew its one by Om and hence worth reading.

    To answer Antonio Rodrigez’s question: At CloudCamp a few weeks/months ago I asked the Amazon WS guy this exact question. His natural answer, which is what we all know and fear, is that even though S3/EC2 are the result of work on Amazon.com’s own infrastructure, they are in-fact separate services which Amazon doesn’t yet entrust with its own web assets.

    Regardless of this I believe these services have a relatively great uptime and that they will improve. The community is reacting to AWS current position by using them only for background computational tasks and hosting stuff like avatars/images. SmugMug took a risk and got burned. When they get better we’ll see entire deployments moving over.

  6. Interesting times and I am sure that every IT department has experienced serious outages ranging from 4 – 8 hours or greater and at what cost. SLA’s will become imperative to future SaaS based technologies and how we measure SLA’s will also change. I believe that Amazon is just touching the top of the iceberg with S3 and they will continue to build the redundancy, improve the architecture and moderate SLA’s to support the future of web computing.

    In working with CIO’s around the world, we have been talking about the power of smart technologies like S3 and other companies like Salesforce.com and OpSource where they are creating real value for their customers. Given the history of Amazon, I would expect them to minimize the downtimes and provide an over arching architecture to support not only the future growth but reliability that Amazon has proven over the last decade.

    Lonnie Wills
    Blog: http://saasevolution.blogspot.com/

  7. Om,
    McNealy and Ellison were the proponents of Network is the Computer – now called cloud computing. Grove said it would never work. Do not remember Steve ever talking about it.

    The issue is of trust and reliability- localized storage/computing is generally better. Having said that the market will be there.

    I think Bezos is having a lot of fun with Amazon. He is the number 1 entrepreneur even ahead of guy who started EBAY. Take that Google and Yahoo boys

  8. @petabro,

    I think you might be onto something, but I think McNealy was way ahead of these guys. I think the subtle difference is that now “network is the corporation”

    I think cloud computing evolution is still in the “first inning.”

  9. Flavio has a good point. Are people not getting what they pay for here?

    After all, the core of the internet is fully redundant and if the connection is that important you can have dual connections to it from your office (at a price).

    So the Amazon problem looks to be one of 2 things.

    1. The system has been built on the cheap to meet a very low price point with an SLA to match. Yet people have signed up to this service knowing that there is no SLA.

    2. Amazon have built this thing. It is not their core business and as such, they really are not the best people to build and deliver enterprise class services (yet). They do not have enough experience. And building your own large platform does not automatically qualify you to build large systems for other people.

    S3 is NOT an enterprise application. It is NOT designed for VERY important data.

    If it was, it would carry an SLA to match.

  10. I think it would be really interesting to compare uptime versus level of control and cost.

    For example, if you pay for and support your own data-centres, can you achieve better uptime than what you get from S3 and EC2? Does the fact that you “control” the data-centres allow you to get back online faster than Amazon fix-up S3 or EC2?

    Note that your costs in going alone are more than just data-centre space, hardware and admin because you’ll have to architect your own storage solution as well, designing in fault-tolerance scaling etc.

  11. The reason Amazon.com didn’t have 404’s is the same reason WordPress.com didn’t have 404’s… they built their site the right way… redundant. Twitter and friends don’t have that level of redundancy.

    I did a little research on it last night in a blog post (click my name to visit my blog… won’t spam it here).

    I’m a little disappointed to see someone raise that question, as anyone in the business should be well aware of cost/benefits in terms of infrastructure and how to be redundant.

    Grids don’t count as redundancy. They are just a higher reliability system. 2 separate grids can be considered redundant.

  12. Om,

    I don’t think that is really the $64,000 question, redundancy fails as well. Look at Microsoft/Ebay/Google, they have all built entirely redundant platforms but they still suffer downtime. This is an inevitable fact in the hosting world and not something people should be concerned about. The things people should be concerned about is A) how much downtime and B) how goes your provider respond.

    Web services on the cloud are perfectly reliable these days, compare it to the physical machines you are on. Our physical gadgets have tons of downtime but I don’t see anyone shunning their dell laptop because of it.

  13. This is scary for anyone counting on Amazon for their hosting without redundancy. Also, their support for AWS is terrible. When you call them it takes you to the Amazon.com support center and they have no clue about AWS. Its a lot like Google Apps (from a support point of view). No direct line, and email support is very poor.

  14. I think the redundancy point is key. You can’t rely on S3 without having a failover option. Maybe this will boost business for Nirvanix as a backup option. Although at that point does it make sense to build your own system?

    @Ross, the issue with owning my own system is that it’s a single point of failure for only my own site, but with S3 or The Planet or Rackspace or any other cloud or host having problems, it’s a single point of failure for multiple systems, which makes the impact larger. Given that, should the site using a host have a Plan B in case the host fails, or should the host have a completely separate and redundant system to minimize a worst case scenario? That’s pretty costly.

  15. While I don’t have hard data to support this, I would bet that far more S3 customers were not affected by this outage than were. It strikes me that even though S3 has been around now for a few years, it’s still the relatively aggressive early-adopters (like SmugMug and Twitter) that have pushed large volumes of primary, online storage into S3.

    For what I’m guessing is the silent majority, S3 *is the backup*. At Wikispaces we use S3 in this manner, encrypting and backing up our data in real time. If our already-replicated local storage gets smashed by Godzilla, we can fall back to S3. This outage was a non-issue, our backup systems just caught back up when S3 came online.

    Om and company, care to gather some polling data? I’d want to know volume of data stored by your readers and whether it’s online (S3 is primary storage) or offline (S3 is backup / archival storage).

  16. Two points about outage (continuing on the point made by jbyers)
    a) Companies should reflect on how they use cloud storage – as a cdn to deliver content or as a primary datastore or as a secondary datastore/backup. If it is more of a secondary datastore/backup, then this outage shouldn’t have affected since most of software accessing the cloud should have resiliency built in (we did and did not experience the outage, check out the blog to see how we handled it:


    On the other hand, if you are using S3 as a CDN or primary store,…
    b) Always have a plan B). This may not be always cost-effective but decide on what are your primary gold services that need to be always working and device appropriate safeguards to protect your business.


  17. There are alot of better services out there such as Nirvanix,but it just seems like this is happening way too much with S3, these guys gotta get it together. Maybe they should just stick to selling baby diapers and shoes, not storage.

  18. Stacey,

    Moving to a “cloud” solution or SaaS has great benefits and we are seeing some of the downside. Outages can happen and do happen but we need to have a realistic expectations. S3 will not be up %100 of the time and what businesses need to ask themselves is “How much pain can we take if our primary service provider goes down?” and is that pain worth buying a secondary solution.

    To me if s3 goes down and I can’t see twitter profile images then that isn’t a big deal. If I want that level of service then maybe we should turn twitter into a pay model so they can afford that. However in the case of Scribd this is a much more serious problem as it rendered their site completely useless. But this isn’t really an s3/cloud issue. If the RackSpace data center in Dallas goes down so does 37signals, they could replicate themselves in another data center but this isn’t cost effective. If anyone’s primary service provider goes down they will be in trouble whether they are in a cloud or on a dedicated cluster in a data center.

  19. For full disclosure, I am with Nirvanix. This is no place for an ad,
    but we know you will be pleased with our Storage Delivery Network. To
    make the process of file and folder migration easier, we have made a
    software tool you may download on our website and begin the automated

  20. Pingback: Bezos
  21. When I first started using S3 I thought it was the most amazing thing ever. Within 24 hours, I had my entire web imaging system (heavy user contribution) using s3. However, I wish amazon’s .NET methods had
    included some setup in web.config that would have specified a local cache folder should their service ever go down. I’m programming this myself now for collarfree.com, and using a getFileLink() method around any s3 hosted file. It checks s3 status every 5 minutes. If s3 is down, it uses the local cache for the next 5 minutes.

    Essentially, it adds our own local servers to the ‘cloud’. I’m surprised s3 didn’t provide this feature in the first place.

  22. We all know that outages will happen and will continue to happen. This is not unique to Cloud Computing nor traditional hosting. I should know as I work at GoGrid (http://www.gogrid.com) that provide Cloud Infrastructure. GoGrid’s parent company is ServePath (a managed services hosting provider) so we have years of traditional hosting experience.

    I won’t get on my soapbox to say that any product is better than another, there are fine distinctions between all of them. What really matters is that you choose your platform (e.g., dedicated or cloud) carefully and that is completes and even compliments your business needs. Cloud Computing will continue to evolve and as it does, standards (e.g., for mirroring or failovers across different providers) will emerge. But you need to look beyond the technology and at the Support, the SLA (if any exist) and the ancillary services as well.

    Thanks Om for this article as it points to the growing pains in the industry, but again, this is no different than a company hosted traditionally experiencing a fiber cut or power outage. The proper thing that IT managers or developers need to think about is how to cleverly architect their offerings in a way to minimize single points of failure. There will be tools or companies popping up all over that will help with this. It may be a bit blurry for a while so the important thing is to continue communicating and keeping this discussion open to the community.

    Michael Sheehan
    Technology Evangelist – GoGrid.com

  23. One point that this emphasizes is that stuff happens.

    I read that one or two of the affected companies chose to put up a cute embedded flash game since none of the features of their application were working.

    I commonly hear from customers that they count on their service provider to monitor their online systems sort of like grading their own work in school.

    My feeling, as an employee of a company that provides remotely hosted website monitoring services, is that trust is good, but it is important to independently verify what levels of service are being delivered to your customers using a service like AlertSite.

    Ken Godskind

  24. Do you think that the Amazon services has had less or more downtime over the year when compared to an average enterprise solution ?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.