[qi:90] Skype’s Heartbeat Blog has an explanation for the 30-hour outage that plagued the eBay-owned (EBAY) voice company last week. A quick overview:
1. Microsoft issued Windows updates on Thursday, Aug. 16th.
2. Millions installed those patches, rebooted, and tried to log into the Skype network — pretty much all at the same time.
3. Combined with a lack of P2P resources, the flood of log-in requests put the Skype network under extreme stress.
4. This, in turn, exposed an unseen software bug “within the network resource allocation algorithm which prevented the self-healing function from working quickly.”
OK, it sounds credible — but do you buy it? Skype Journal has some questions, namely if the bug’s fix has been propagated. What, they ask, is preventing this from happening again? After all, Microsoft (MSFT) routinely issues patches. Borough Turner, chief technology officer of NSM Communications, alludes to this in his most recent post.
Experts have pointed out that Skype generates a lot of traffic between log-in servers and supernodes. Maybe the supernodes went down during the “patches” as well. Someone who seems to be familiar with the Skype network architecture left a comment earlier that explains this relationship between 50-odd authentication servers and supernodes and also a weak link.” (Full explanation is here.)
I was reading the first lines with “Collected Explanations, Courtesy of Skype” and laughing because I was thinking, “yeah, right. So are you really buying into this is a good question. This instance here is not. Thanks for all the news here and sound analysis.
A couple of comments:
Microsoft actually issues the patches on Patch Tuesday (late in the evening), the 2nd Tuesday of the month. So they went out late on Aug. 14. (Wednesday morning I found two of my WinXP PC’s had been rebooted after the update.)
The Windows Update procedure, for those who have provided the appropriate permissions, automatically updates users’ PC’s around the world; if necesary, as was the case here, the procedure ends with an automatic PC reboot. My experience is that the process on individual PC’s takes about two days to reach everybody who has registered for the auto Windows update (a procedure which one should follow for security reasons).
At some point late Wednesday or early Thursday the August Update uploaded to too many PC’s concurrently for the Skype infrastructure to be able to log back into Skype.
While they have provided this information a little more on what they have done to address the “lack of peer-to-peer network resources” mentioned in their statement would help PR-wise.
A word of ominous caution: the next Microsoft Patch Tuesday occurs on 9/11.
“within the network resource allocation algorithm which prevented the self-healing function from working quickly.”
wow, these guys sound like Bear Sterns explaining how their quant fund lost an f*** load of investor money.
window is right to upgrade its s/w — and helps to open – one of the scandalous secrete of SkyPe/ eBay , where they use user’s computer as one of the network node.
Sorry, but I dont get it. Granted I dont know the numbers (so please correct me!).
How many of these millions of users are paying for Skype vs. free service?
Complaining because their free service was unavailable for 30 hrs… Sheesh.
Give it a rest…
So it’s Microsoft’s fault that everyone rebooted because of patches and tried to log back in? I guess, if you live in Cracksmokin’ville.
Capacity planning choked on this. This is a Skype issue, not a Microsoft issue. I have no love for Microsoft, but let’s solve the issue – buggy Skype code.
Everyone seems to be blaming Microsoft or George Bush on this one. I wish I knew why…
I guess they weren’t doing enough unit testing
share your startup stories
http://startupflames.com
Pure CYA.
This type of “blame someone else” runs through Ebay corporate. I’m sure they’re “directing” (soft hand of course) the PR for Skype and “suggesting” how to differ responsibility.
Pfft. Me bitter? Nah…
Skype accepted that final responsibility laid upon their shoulders – I don’t think they are blame shifting.