December 29, 2010

What K-O'ed Skype Last Week

Skype, the Internet telephony service went on the blink last week, stranding millions who use it for their communication needs. It took more than a day for the service to be restored. In conversation, CEO Tony Bates told me that the problem might lie with some errant Windows Clients. Well, make that many errant Windows clients! Today Skype’s Chief Information Officer, Lars Rabbe offers more details in a blog post.

In a nutshell, Skype says it was bug in a Windows Client software which lead to overloading of certain super nodes, which crashed and thus caused a chain reaction of problems.

On Wednesday, December 22, a cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. Because of a bug identified in a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash.

Around 50 percent of all Skype users globally were running the 5.0.0.152 version of Skype for Windows, and the crashes caused approximately 40 percent of those clients to fail. These clients included 25–30 percent of the publicly available supernodes, also failed as a result of this problem.

I wonder if some of these problems were brought on by recently introduced aggressive “forced updates” which have not gone down well with some users. Voxeo CEO Jonathan Taylor offered up the theory that buggy software that was pushed on to Windows users was to blame.

If you had the latest Skype for Windows (version 5.0.0.156), older versions of Skype Windows (4.0 versions), Skype for Mac, Skype for iPhone, Skype on your TV, and Skype Connect or Skype Manager for enterprises, you were not initially affected by this problem. However, with nearly a quarter of Skype’s super nodes going down, it quickly became a network-wide problem.

A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients and establishing connections between them by creating local clusters of several hundred peer nodes per each supernode.

Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30 percent fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes. A significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud.

In order to deal with the problem, Skype essentially introduced “thousands of instances” of the Skype software into its P2P network and created temporary supernodes. The biggest lessons learned from this, Rabbe writes:

More investments in their infrastructure so that the system becomes and stays reliable.
More rigorous testing procedures that don’t let buggy software out into the market.This is not the first time Skype systems came under pressure because of faulty bugs. In August 2007, Skype had software problems as well, which in turn caused a flood of log-in requests and crashed the network.

Related content from GigaOM Pro (sub req’d):

12 comments

12 thoughts on this post

don says:

December 29, 2010 at 8:01 am

One more reason NOT to use Windows.

Reply
muppets says:

December 29, 2010 at 8:14 am

oh come on, you can’t blame windows for this one. It was a skype problem with some badly written skype code for the windows problem.
So put down your Apple / Nix soap box and shut up.

Reply
1. Om Malik says:
  
  December 29, 2010 at 8:44 am
  
  Hmmm…. how am I on my Apple soapbox. It seems you want to read what you want to read. The article clearly states that it is a problem with Skype.
  
  Reply
  1. aep528 says:
    
    December 29, 2010 at 11:19 am
    
    Do you really not understand he was responding to the post above?
    
    Reply
  2. don says:
    
    December 29, 2010 at 2:49 pm
    
    And they both are rude.
    
    aep528: muppets should have clicked the Reply button – IF that was the comment he meant to shut up (rude).
    
    fwiw: Apple was not referenced in my comment, was it?
    
    Reply
Peter says:

December 29, 2010 at 8:34 am

No, it’s a Skype problem. They rely on their end users to provide the computing power necessary (instead of running centralized servers), and bill you for the privilege when you want to use premium services. A brilliant business model.

Reply
1. Om Malik says:
  
  December 29, 2010 at 8:45 am
  
  +1 to that. And there is a reason why they don’t to draw attention to the forced updates that seem to have pushed out buggy software so quickly.
  
  Reply
  1. gzino says:
    
    December 29, 2010 at 12:20 pm
    
    True but their architecture is arguably more resilient and redundant than any major SP (though maybe their software dev/QA/update processes need some tuning).
    
    Question: do we know if the temporary supernodes were regular, user-owned computers, just promoted, or if they were in fact Skype owned/rented nodes?
    
    Reply
Pingback: How Skype Could Make the Mobile Video Market Explode: Tech News «
Aswath Rao says:

December 29, 2010 at 5:46 pm

Buggy Windows client may be the instigator and forced upgrade could have exacerbated the problem. But lack of some operational procedures are more glaring:
1. not ensuring that the population of supernodes are diverse (not same OS, not same version of app)
2. protecting overloaded supernodes from new additions to the network
3. not preventing new nodes from being added which would increase signalling traffic between the supernodes

Reply
Pingback: How Skype Could Make the Mobile Video Market Explode
Pingback: Skype’s Back: Record User Numbers Thanks to New iOS App: Tech News and Analysis «

What K-O'ed Skype Last Week

Leave a Reply Cancel reply

12 thoughts on this post

Share on Mastodon