Skype, the Internet telephony service went on the blink last week, stranding millions who use it for their communication needs. It took more than a day for the service to be restored. In conversation, CEO Tony Bates told me that the problem might lie with some errant Windows Clients. Well, make that many errant Windows clients! Today Skype’s Chief Information Officer, Lars Rabbe offers more details in a blog post.
In a nutshell, Skype says it was bug in a Windows Client software which lead to overloading of certain super nodes, which crashed and thus caused a chain reaction of problems.
On Wednesday, December 22, a cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. Because of a bug identified in a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash.
Around 50 percent of all Skype users globally were running the 126.96.36.199 version of Skype for Windows, and the crashes caused approximately 40 percent of those clients to fail. These clients included 25–30 percent of the publicly available supernodes, also failed as a result of this problem.
I wonder if some of these problems were brought on by recently introduced aggressive “forced updates” which have not gone down well with some users. Voxeo CEO Jonathan Taylor offered up the theory that buggy software that was pushed on to Windows users was to blame.
If you had the latest Skype for Windows (version 188.8.131.52), older versions of Skype Windows (4.0 versions), Skype for Mac, Skype for iPhone, Skype on your TV, and Skype Connect or Skype Manager for enterprises, you were not initially affected by this problem. However, with nearly a quarter of Skype’s super nodes going down, it quickly became a network-wide problem.
A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients and establishing connections between them by creating local clusters of several hundred peer nodes per each supernode.
Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30 percent fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes. A significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud.
In order to deal with the problem, Skype essentially introduced “thousands of instances” of the Skype software into its P2P network and created temporary supernodes. The biggest lessons learned from this, Rabbe writes:
- More investments in their infrastructure so that the system becomes and stays reliable.
- More rigorous testing procedures that don’t let buggy software out into the market.This is not the first time Skype systems came under pressure because of faulty bugs. In August 2007, Skype had software problems as well, which in turn caused a flood of log-in requests and crashed the network.
Related content from GigaOM Pro (sub req’d):
- Research Note: What a Skype-Cisco Partnership Could Mean
- Report: U.S. Mobile Venture Capital Investment, Q2 2010
- Report: Google’s Voice Possibilities