Sep 24, 2010 07:28 GMT  ·  By

As many of you undoubtedly noticed, Facebook was down for quite a long period of time yesterday. The site was unreachable for the best part of two and half hours.

To make matters worse, the API was down as well, meaning that all the "Like" buttons you've been seeing around the web weren't working either. Facebook has now issued an explanation for the outage and has apologized for the problems it may have caused.

"Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it," Robert Johnson, Facebook's Director of Software Engineering wrote.

"We also wanted to provide much more technical detail on what happened and share one big lesson learned. The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed," he explained.

With 500 million users worldwide and a few hundred million checking in every day, Facebook's outage hardly went unnoticed. People took to every outlet they found to complain, from blogs to 4chan.

The automated system that was at the heart of the issue was designed to keep the local cache in sync with the persistent storage. It looked for configuration keys that were invalid in the cache and updated them with the correct values.

An update to the persistent storage copy was seen as invalid by the system which then attempted to fix it. This meant that every client in Facebook data centers, and there are a lot of them, queried the main database for the correct value.

The surge in traffic brought the database cluster to its knees which exacerbated the problem since connection errors were interpreted as invalid values by the clients.

This feedback cycle was impossible to break except by shutting down the whole site, which is what Facebook did. After this, the site slowly recovered and users were able to connect again.

The automated system that caused the outage has been disabled, Facebook says, as a better solution is being investigated.