Azure suffered outage due to fire suppression system

Oct 4, 2017 11:59 GMT  ·  By

Microsoft Azure customers in Northern Europe might be aware that a number of Microsoft services went down on September 29, but until now, no specifics on what exactly triggered the outage were available.

As it turns out, the fire suppression system into one of the data centers is to blame for the downtime experienced by Virtual Machines, Cloud Services, Azure Backup, App Services\Web Apps, Azure Cache, Azure Monitor, Azure Functions, Time Series Insights, Stream Analytics, HDInsight, Data Factory and Azure Scheduler, Azure Site Recovery between 13:27 and 20:15 UTC.

Microsoft says during periodic maintenance, the staff accidentally configured the fire suppression system to release inert agent in the data center. This means the system handled the wrong configuration as a potential fire incident and automatically shut down the Air Handler Units (AHU) to prevent oxygen from being distributed inside the room and limit the damages of the possible flames.

Systems triggering auto-shutdown due to overheating

Microsoft says that despite the staff restoring the air conditioning very fast, the temperature in specific areas was still very high, in its turn causing systems to automatically shutdown or reboot in order to prevent data loss and corruption.

The automatic response, however, led to some servers and storage resources to fail to shut down in a controlled manner, contributing to increased downtime and requiring Microsoft staff to manually troubleshoot and restore the machines. It took nearly 7 hours for engineers to bring all systems back online, Microsoft says.

“We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Suppression system maintenance analysis continues with facility engineers to identify the cause of the unexpected agent release, and to mitigate risk of recurrence,” Microsoft says.

“Engineering continues to investigate the failure conditions and recovery time improvements for storage resources in this scenario.”

The software giant says the investigation continues and more information on the incident will be published by October 13.