Apr 29, 2011 14:50 GMT  ·  By
Amazon promises to improve its operations to prevent similar outages in the future
   Amazon promises to improve its operations to prevent similar outages in the future

As promised, Amazon has now provided a detailed account of the events that led to the massive Web Services outage and how it unfolded. The company has also apologized to the customers affected by the issues.

Criticized for its initial lack of communication, the team promised to be more transparent and ensure more communications during outages or when problems occur.

The issues started shortly after midnight, Pacific Daylight Time, on April 21 and were initiated by a planned upgrade of its network capacity for one Availability Zone in the US East Region.

A configuration error occurred which caused some EBS nodes to become isolated from the network and their replicas. This prompted the nodes to seek for space to "re-mirror" their data.

Since many nodes were affected, there was not enough storage space available to serve them all, leaving them "stuck" in a loop. This sequence of events cascaded and, combined to other issues, led to the outage and data loss in some instances.

A detailed timeline of the events is available in the post mortem Amazon published.

"The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures," Amazon promised to make changes to prevent this type of event from happening again.

Amazon also said it plans to communicate more with its customers during a crisis, something it failed to do initially.

"In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications. We would like our communications to be more frequent and contain more information," Amazon said.

"Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services," Amazon also apologized for the problems.