Dec 29, 2010 15:02 GMT  ·  By

Last week, as many of you may be aware, Skype suffered quite a big outage lasting for up to a day. While it wasn't a total shut down of the service, many users experienced problems connecting. Skype has now provided a detailed explanation of why this happened and what it is doing to prevent it from happening again.

As it is often the case, the massive outage was due to a combination of problems. The way the system is set up made it prone to a domino effect, where the initial issue led to another, which led to another with little ways of preventing the progress.

The initial cause was an overload of some servers responsible for offline messages. Because of the overload, messages sent to the clients were delayed.

For most client versions, this wasn't a problem, but a bug in Skype for Windows 5.0.0.152 led to crashes when the delayed messages were received.

The bug only affected this particular version of the Skype client. Skype for other platforms or the older Skype 4.0 for Windows didn't have this problem and neither did the newer Skype 5.0.0.156, which not a lot of users had upgraded to.

As it happens, Skype 5.0.0.152 was the most popular client at that time, with about 50 percent of users running it. Of those, 40 percent were affected by the bug, so only 20 percent of Skype users.

This wouldn't have been such a big issue if it weren't for Skype's peer-to-peer infrastructure. Skype uses p2p for communications meaning that clients actually 'talk' directly to each other without using a server. This method enables Skype to offer free VoIP services but it is also a weakness.

Even though the system is p2p, some clients act as "supernodes" which coordinate the actions of hundreds of clients. While Skype shutdown the overloaded servers, effectively removing the cause of the crash, 25 percent to 30 percents of supernodes were taken down by the crashes affecting the Windows clients.

As a result, the remaining supernodes had to handle all of the communications. This was made worse by the fact that users kept restarting the crashing clients which led to a huge increase in traffic.

Skype says that traffic to the remaining supernodes was about 100 times bigger than what they would normally get. Supernodes have a built-in mechanism which prevents them from taking up too many resources on the client machine by shutting them down.

With the load increase, more supernodes shut down leading to even more strain on the remaining ones. Eventually, most of the supernodes became unavailable effectively preventing users from connecting.

Skype intervened by artificially adding more supernodes to the system, which it dubbed "mega-supernodes." It did this by diverting resources from the group video call feature. Skype continued to add these mega-supernodes until the system started to restore itself. The outage lasted for about 24 hours.

Skype says that it is now working on several ways of preventing this from happening in the future. One way is by possibly implementing automatic updates, for minor version patches.

Skype is also looking at ways it can detect problems and act on them sooner. The company says that the testing procedures will get an overhaul as well.

Skype has become indispensable to millions of users and, while the basic service is offered for free, people have come to expect that the services they use, particularly one as important as communications, work regardless of what they are paying for them.