Human error can sometimes cause a domino effect

Mar 2, 2017 22:21 GMT  ·  By

Countless web services, sites and apps went down the other day due to a problem with Amazon Web Service. What caused all this commotion? A typo. 

We wish we were kidding, but we're not. According to a detailed report Amazon released today, an employee entered what they believed to be a routine command to remote servers from an S3 subsystem. By mistake, they entered a number that was much larger than it was supposed to be. The servers targeted by the command supported two other S3 subsystems, both of which manage the storage and metadata for the entire region. It went downhill from there.

"An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region," the report explains.

Have you tried turning it off and on again?

Fixing this error should have been as simple as rebooting the subsystems, but AWS admits that this step hasn't been taken in years, despite the considerable size the S3 has reached.

The company wants to be prepared for this type of situations in the future, and prevent them if possible. Therefore, it has announced it will implement safeguards that have actually been in the works for some time now. They have now become a priority for AWS, which is understandable.

"Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further," the company concludes.