Affecting the deployment service

Mar 18, 2009 10:31 GMT  ·  By

The past weekend Microsoft's Cloud operating system suffered an extensive outage lasting approximately 22 hours. Microsoft fessed up to the malfunction, which lasted from Friday until Saturday, and which had the potential of affecting all Windows Azure Community Technology Preview (CTP) participants. In fact, some CTP testers experienced degraded service and even downtime. The Redmond company acknowledged the Windows Azure blackout and explained its source.

“During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail,” revealed a member of the Windows Azure team.

The Windows Azure team was alerted of the server failure by the monitoring system set in place, but at that time, an automatic recovery system began work designed to bring crashed apps back online. “The Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time. Because this serial process was taking much too long, we decided to pursue a parallel update process, which successfully restored all applications,” the Windows Azure team representative added.

Microsoft explained that all the applications that were associated with and running a singular instance were affected by the outage in concordance with server failures. The blackout also affected apps running multiple instances. However, in this case the problems were less severe, mainly degradation, but not downtime.

“In addition, the ability to perform management tasks from the web portal appeared unavailable for many applications due to the Fabric Controller being backed up with work during the serialized recovery process,” the Windows Azure team member added.

Microsoft promised that it was tackling the problems head on, first off by resolving the network issues that caused the blackout in the first place. At the same time, the Windows Azure recovery algorithm will evolve in order to be able to handle eventual future problems on its own.

“For continued availability during upgrades, we recommend that application owners deploy their application with multiple instances of each role. We'll make two the default in our project templates and samples. We will not count the second instance against quota limits, so CTP participants can feel comfortable running two instances of each application role,” the Azure team member said.