Docs became unavailable to all users for about an hour

Sep 10, 2011 09:02 GMT  ·  By

As some of you may have noticed, Google Docs completely melted down earlier this week for about one hour, leaving users without access to their documents or the documents list.

Google fixed the problem quite rapidly, but it is now explaining what happened and what steps it's taking to minimize the chance of it happening again or at least the collateral damage.

As it usually happens, a small update that started rolling out to all users surfaced a previously unknown and unrelated memory management bug.

The bug triggered a domino effect, again, as it usually happens when cloud services go down, and caused the entire infrastructure to fail under heavy use.

"The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage," Alan Warren, engineering director at Google, wrote.

"Every time a Google Doc is modified, a machine looks up the servers that need to be updated," he explained.

"Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart," he said.

"While they restarted, their load was picked up by the remaining lookup machines - making them run out of memory even faster," he added.

Google noticed the increased rate of failure quite fast, only 60 seconds after the rate started accelerating, and the engineers believed, correctly, that the increase was related to the code update.

They then rolled back the update and brought the system back to stability. The whole issue took about an hour to resolve.

Google says it has learned several things from the incident and that it will implement several changes that should make it hard for this type of bug to affect all of its Docs infrastructure. It also said that it will provide more details once its investigation is complete.