48 thoughts on “The Web 2.0 Hit By Outages”

  1. Yeah, Typepad has basically shut down the business2blog today, and set it back to last Friday. This is extremely frustrating and completely unacceptable.

    Six Apart may be a “Web 2.0” company, but it is facing some very Web 1.0 problems. Sort of reminds me of eBay’s early outages. Assuming that Six Apart is built on cheap off-the-shelf servers, I guess now we’ll see whether cheap scales.

    In the meantime, anyone out there know which hosted blogging service has the most industrial-strength offering, or is this pretty much what we are stuck with?

  2. Great bit information. There are two sets of numbers.

    Whenever average seek time is high it means one of two things. Product is designed poorly or
    Vendor does not have enough bandwidth ( This one is easy to fix $$$). On the other hand if availability is below 99% then most probably the service has lot of bottle necks in the design. Best advise look for alternate vendor.

  3. I would offer the problem isn’t that growth wasn’t considered from the start, rather that it hasn’t been *reconsidered* by some of these services frequently enough.

    Dare Obasanjo from Microsoft talks about how even the big guys have these issues from the position of a company that launched MSN Spaces and grew it to 3x LiveJournal in 1 year:

    “The fact is that everyone has scalability issues, no one can deal with their service going from zero to a few million users without revisiting almost every aspect of their design and architecture.”

  4. I was going to mention the del.icio.us outage too — they said it would only be an hour, but it was several times that long. Something about moving servers or something — but it made me realize how much I use the damn thing.

  5. Typepad lost a few entries I had written as well, bigtime suckage.

    There’s a difference between a service being down for a little while and them actually losing your data. Absolutely ridiculous.

    Anil, you’re probably reading this, what are you guys going to do about the data loss?

  6. There are 2 reasons why I run my own blogging server (I use Blojsom) 1) because I am picky about where MY essays reside — my post are mine and I don’t like it being somewhere else, and 2) because I have control of the information, etc. — if something goes wrong, well it is my own fault.


  7. Jason says “The fact is that everyone has scalability issues, no one can deal with their service going from zero to a few million users without revisiting almost every aspect of their design and architecture.â€? Sorry, I have to disagree. There are lot of load and stress testing tools to check your scalability. Rewrite many times before you launch the product not afterwards.

  8. Well, to be fair, Earthlink DSL for California was down all of yesterday. I’d rather have my browser give me a 404 error at a web 2.0 site than just show me the admin panel of my modem for 16 hours.

  9. “There are lot of load and stress testing tools to check your scalability. Rewrite many times before you launch the product not afterwards.”

    Sure, I can simulate 10,000,000 users using my product and break it, but it may take me 5 years to get that customer base. Why spend the time NOW to worry about something that *may* happen later? That takes time away from me improving my product in ways that benefit my customers NOW.

    Anyhow, I suspect we’ll just disagree eternally.

  10. First of all, I think it can be confirmed that typepad is not running backups that are 2 days old, but rather from 12/9.

    The other other thing that I am finding especially irritating is that they should have some kind of banner notifying people viewing the blogs that they are looking at content that is not current. I had far too many emails this morning saying “hey your blog is a week old”.

    It is interesting to consider that the weak link in web 2.0 is in fact the hosting providers. When typepad goes down nobody can post which means memeorandum can’t accurately portray information amplitude.

    I’m done with Typepad, the performance has really sucked for the last 3 months but it’s another thing altogether when they essentially go offline and potentially lose my content. It’s been said that everyone has scalability problems, but the corollary to that is that scalability problems have been solved before so why are we still having them in these consumer services?

  11. jeff

    you bring up good points, but these problems are not exclusive to typepad. i know how frustrating this can be, but i think most other hosted services are going to have these issues as well.

    have you checked out squarespace.com

  12. Pingback: POP! PR Jots
  13. The weak link in the whole web is the hosting providers, period. Redundancy is expensive, yes, but common sense needs to kick in at some point. It appears that they have one disk storage sub-system that went down last night. ONE???? How about an inexpensive clustering solution? How about a warm spare with a two-hour old snapshot? Not even talking about multiple data centers, which anyone who’s serious about uptime is going to be looking at. With the fees they’ve refunded already from the problems in November, they could probably purchase a mid-range SAN solution. I don’t pretend to know SA’s business metrics, but this has got to be costing them big time, both in dollars and in trust. Architecting a seriously reliable hosting model is HARD, believe me, but the paybacks are worth it.

  14. Seems you brought up a fairly timely topic recently, Om.

    I think the key in the capacity planning debate can be resolved through each individual service provider analyzing the following factors:

    1) Customer expectations
    What level of service reliability is expected by your customers? This is driven around factors such as if they pay for the service, how much they rely on it throughout the day, etc.

    2) Financial position and strategy
    Some smaller companies may like to stay lean and not risk unnecessary infrastructure investments

    3) What damage will outages do to customers’ brand loyalty

    4) If the site is down, do you lose revenue for every minute down? (eCommerce sites)

    There really is no black and white answer to this, but anyone would have to admit, frequent or extended service outages are totally unacceptable by most users.

    Of course, there are mathematical models and other somewhat complex exercises available from operations management science to use in these scenarios, but my guess is they are rarely relied upon.

  15. Hmmm I am seeing a lot of complaining and criticizing going on in here. Being someone who has actually had to run both an ISP and all the servers for said ISP the issue of scalability is an extremely complex problem to address.

    I see that someone here has the idea that simulating a “few million users” is something that is supposedly easy or simple. Easy is a highly subjective term depending on what kind of resources, time, and most importantly support for management one has. It is possible to break any web service or site if you throw a large enough load at it, period.

    What you are trying to do instead is run a system that has its load evenly balanced between all the parts that comprise it. Too much dependance on one parts performance and it doesnt matter how good your other parts are if something breaks. See, the big issue is that these services are complex entities and usually all it takes is just one piece breaking to bring the rest crashing down. You could have a disk failure, or your DB software could tank, or the DB could not be able to talk to the SAN, or your web server could have problems talking to its backend. Or your load balancer could go tits up. Or your ingress router could not be able to handle the sustained packets per second of traffic. Or a backup process could be blocking write access to a critical file.

    All of these problems would result in the same kind of problem in the end: site unavailable. Safeguarding against any one of the potential problems is possible with time and money and people. Making sure all components are safeguarded is harder and more involved obviously.

    However the biggest issue is that testing failure modes is a massive pain in the ass. What are you going to do to simulate say 10 million customers hitting your site? Typically, the amount of equipment required to do testing on that scale costs as much or sometimes even more then the system you are implementing! Hell just look at how expensive SmartBits test equipment is for massive traffic loading.

    This doesnt even go into the issues of wether or not a test was performed in conditions that would happen in the real world. For example, your server might take 10 million customers hitting its front page, but what if that was 10 million people hitting the site and all looking for something different? Totally changes what subsystems are stressed and to what levels.

    Typically, the best many folks can do is build it as best as they think they can and then throw it out into the real world and see what happens and then do tweaks to it as they learn how the system reacts to different inputs. The number of people out there who understand how to build a website capable of scaling to massive loads is small because it all depends on the kind of site and the services it provides and relies on and one has to become a subject matter expert in all the parts of the site so as to best design it.

  16. Just to be clear, despite displaying cached blog pages, there is no data lost on TypePad. we’ll be republishing the pages to bring them current now that the service is back up.

  17. Have things really not improved that much since Ebay was having all of their scalability issues five or so years ago? That’s kind of surprising to me.

  18. The biggest issue in my mind is whether Web 2.0 companies are going to survive their success or not. SixApart has known it had reliability issues for months, and it’s had funding to hire some top people to build a world-class infrastructure. Has it done so?

    Maybe… and maybe the company needs to hire its own Meg Whitman, someone who has experience running a much bigger company, to help them grow SixApart?

  19. Putting up my URL at this time is ironic – it’s on 6A TypePad. It is true that I can reach TypePad. It is not true to say that means sites have been updated when you hit ‘View Website.’ 6A explains this and that’s fair enough.

    The issue for me is this has been the case since around 0600 CET – that’s 9pm PT I think. Yet there was virtually no online coverage of the issue until around 1700 CET when The Register posted a quick thingy on it. A number of us in Europe were left scratching our heads with little clue as to what has been going on. We still don’t know.

    On the so-called blogosphere? Zippo. Nada. Nix. Nothing.

    MSM? Zippo again.

    So tell me this. Just how influential is this media? Really. Truly. Honestly.

    And on scalability – you’re right Om. I saw the red light when Canter and Ismail started talking about Structured Blogging a few days ago. In hindsight, I wish I’d listened a little more closely to the ‘oh-oh’ antennae.

    Anyone want to speculate how much this has shifted any ‘tide’ towards OSS back towards MSFT?

  20. Current Issues with TypePad Posted by Michael Sippey in On Typepad website

    “During routine maintenance of our network and storage systems last night, we experienced an issue with our primary disk system where data from published blogs are stored.”

    I guessing here, looks like the data store is not distributed. Every time something happens to that primary disk system, typepad most probably will go down. This also might lead to scalability issues down the road if they acquire more customers.

  21. I have a modest proposal.

    Each time you, Om, or other prominent bloggers use that frickin’ empty phrase “Web 2.0,” you should donate $20 to a non-denominational, non-partisan charity. No limits. Use the phrase 4 times in one blog entry, owe $80. (but, in a moment of kindness towards you, I won’t hold you responsible for mentions in resyndication and such)

    The outages have absolutely *NOTHING* to do with any Web 2.0’ish, Web 1.9’ish, Web 2.1713’ish etc. A few *very visible* companies have been had outages recently.

    I know, I know, that makes for a much less juicy title: “Many popular companies have recently experienced outages”… but it makes me wretch a whole lot less than reading yet another “Web 2.0” headline.

    I mean, seriously, what next? “Web 2.0 causes divorces.” “Web 2.0 responsible for illiteracy in Namibia.” “Politicans ignore Web 2.0 issues in recent debate.”

    For crying out loud… can we just talk about companies on their own merits and stop trying to classify and rename and illogically group stuff?

    MUCH thanks in advance!!!

  22. Why spend the time NOW to worry about something that may happen later? That takes time away from me improving my product in ways that benefit my customers NOW

    Because it may well kill your business. Same way you plan cashflow. Typepad not taking scalability seriously and having a bulky and innefficient app – not to mention the skills required to run a large scale hosted service will lose them more and more customers in the long run.

    As I mentioned in the other post startup businesses have to be tight with server and co-lo providers because it is an essential part of any web business.

    Also, getting to 99.9999% reliability is more of an issue of skill rather than money. Do you think Microsoft solved the Spaces problems by throwing lots of money at it, or did they use their experience in running large-scale infrastructure to sort the problems out?

    Feedlounge is such a good example of this, poorly tested and poorly planned and now almost a year later still no product.

    Benchmark your applications early and plan your expansion based on that. Using ‘ab’ I can measure how much resources an app requires and plan based on that in less than an hour – keep pushing up the number of concurrent connections till your server maxes out, now work out how many hits on average from each user per day and work out your peak times to come to a total number of hits per minute. Divive this up with you ab results and you get an idea of how many servers you need. Use round-robin DNS, spread out your databases and replicate.. replicate again to ‘warm’ servers. If your app is maxing our a server with only 10 concurrent hits then it is time to re-evaluate your architecture and how your application has been put together. Response times should be

  23. Pingback: My Stuff
  24. I wonder how many people here bitching about outage are on free accounts? And how many have never had to reboot a live system because something got snarled up? Stuff happens, people – get over yourselves. No-one died, fer gossakes.

  25. Because it may well kill your business.

    No, it won’t. Online services *never* go out of business because of system scalability issues. They go out of business ALL THE TIME from not getting enough customers and not making money.

    It crippled me since just about all of my bookmarks are there.

    Would you listen to yourself? “Crippled” because your social bookmarks are down for a few hours???

  26. Pingback: Betaflow.com
  27. Why would anyone in business and fed up with the TP service chose to move it to what amounts to another ISP? Surely the better alternative would be to think about this as an opportunity to evaluate the landscape and think about their aspirations around this medium.

  28. I was going to mention the del.icio.us outage too — they said it would only be an hour, but it was several times that long. Something about moving servers or something — but it made me realize how much I use the damn thing.

  29. One of the issues Web 2.0 companies have in building solutions on the cheap is that they don’t plan for real scalability of their infrastructure. Which means that they melt down whenever their traffic/audience grows faster than their ability to add servers/gears.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.