A routing capacity problem took down the server for almost two hours

Sep 2, 2009 06:58 GMT  ·  By

Google has some of the most reliable services online, but even those aren't perfect and, when they fail, everybody knows. A Gmail outage that lasted about 100 minutes was the talk of the Internet yesterday, affecting millions of users. Google engineers managed to get the service back up eventually and everything is back to normal now. The problem was caused, as it is usually the case, by several issues that “worked” together overloading some servers in a cascade effect bringing the whole service down.

“Gmail's web interface had a widespread outage earlier today, lasting about 100 minutes. We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service,” Ben Treynor, VP of engineering and site reliability czar, wrote. “Thus, right up front, I'd like to apologize to all of you — today's outage was a Big Deal, and we're treating it as such. We've already thoroughly investigated what happened, and we're currently compiling a list of things we intend to fix or improve as a result of the investigation.”

The outage started in a rather innocuous way, when a small number of Gmail servers was taken offline for maintenance. This happens regularly at Google, and at any other large company for that matter, and users would normally be unaware of it. However, some recent changes made to the system caused an unexpected amount of traffic for the request routers, the servers that redirect traffic to the appropriate Gmail server. This created a cascading effect, as, once a router became overwhelmed, it diverted traffic to another server, which, in turn, would also become overloaded.

In a matter of minutes, all of the available capacity allocated for Gmail was used up and users couldn't access the web client. They could still access their mail through IMAP/POP, as this didn't employ the same routers as normal requests. Google engineers handled the problem by bringing in additional request routers from Google's massive infrastructure capabilities and Gmail eventually became accessible again. Google says it has learned a lot from this outage and plans to upgrade the routing capacity and also to implement several changes to the way it currently handles requests and high loads. But, even with the outage, Gmail remains one of the most reliable services, at least in the consumer sector, with a 99.9-percent availability.