Softpedia
 

NEWS CATEGORIES:



NEWS ARCHIVE >>
SOFTPEDIA REVIEWS >>
MEET THE EDITORS >>
Home > News > Webmaster > Internet Life

September 24th, 2010, 07:28 GMT · By

Facebook's Massive Outage Explained

SHARE:

Adjust text size:


Facebook was down for 2.5 hours
Enlarge picture
As many of you undoubtedly noticed, Facebook was down for quite a long period of time yesterday. The site was unreachable for the best part of two and half hours.

To make matters worse, the API was down as well, meaning that all the "Like" buttons you've been seeing around the web weren't working either. Facebook has now issued an explanation for the outage and has apologized for the problems it may have caused.

"Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it," Robert Johnson, Facebook's Director of Software Engineering wrote.

"We also wanted to provide much more technical detail on what happened and share one big lesson learned. The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed," he explained.

With 500 million users worldwide and a few hundred million checking in every day, Facebook's outage hardly went unnoticed. People took to every outlet they found to complain, from blogs to 4chan.

The automated system that was at the heart of the issue was designed to keep the local cache in sync with the persistent storage. It looked for configuration keys that were invalid in the cache and updated them with the correct values.

An update to the persistent storage copy was seen as invalid by the system which then attempted to fix it. This meant that every client in Facebook data centers, and there are a lot of them, queried the main database for the correct value.

The surge in traffic brought the database cluster to its knees which exacerbated the problem since connection errors were interpreted as invalid values by the clients.

This feedback cycle was impossible to break except by shutting down the whole site, which is what Facebook did. After this, the site slowly recovered and users were able to connect again.

The automated system that caused the outage has been disabled, Facebook says, as a better solution is being investigated.

TELL US WHAT YOU THINK:

1,318 hits · Link to this article · Print article · Send to friend · Subscribe to news

MUST-READ RELATED ARTICLES:


Facebook Places Goes Live in the UK

Amazon Adds Social Recommendations Based on Facebook

Facebook Places Launches in Japan

Facebook Photos 5 to 6 Times Bigger than the Competition, Combined

Facebook Unveils Q&A Feature

READER COMMENTS:



No user comments yet.
Be the first to express your opinion!
Copyright © 2001-2012 Softpedia. Contact/Tip us at

WindowsGamesDriversMacLinuxScriptsMobileHandheldNews

SUBMIT PROGRAM   |   ADVERTISE   |   GET HELP   |   SEND US FEEDBACK   |   RSS FEEDS   |   UPDATE YOUR SOFTWARE   |   ROMANIAN FORUM