Only some applications were affected, claims Microsoft

Apr 3, 2009 08:39 GMT  ·  By

Microsoft confirmed that a bug affecting its Windows Azure operating system rendered some applications running in the Cloud inaccessible between 1:15pm PDT and 9:00pm today. The Redmond company identified the issue and already resolved the problem at the time of this article. Windows Azure testers should be able to see that full access and functionality have been restored to all their Windows Azure applications.

“Today a number of applications running in Windows Azure were unreachable for a period of time. The earliest occurrence took place at 1:15pm PDT today, our team restored affected applications throughout the day, and all applications were accessible by 9:00pm PDT. The storage service was unaffected,” revealed a member of the Windows Azure team.

During the debugging process Microsoft temporarily disabled the Fabric Controller, the monitoring and recovery system designed to watch over Windows Azure in order to make sure that nothing would go wrong. Because of the preventative step of shutting down the Controller, testers lost the ability to view application status or to perform any sort of management operations via the web portal. And in fact it was the Fabric Controller at fault for the inaccessible applications.

“A number of servers stopped responding to the Fabric Controller. The Fabric Controller immediately marked those servers as malfunctioning, stopped routing traffic to them, and started migrating affected application instances to new servers. A bug in the Fabric Controller prevented it from moving those instances as quickly as it was designed to do. An application with all its instances on affected servers became unreachable during this period,” the Windows Azure team representative added.

Microsoft revealed that the Fabric Controller was set up to check up with the servers and issue status requests on a regular basis. The move came as the system needed servers and applications to communicate both their status and health. In the eventuality of unresponsive Windows Azure server and Cloud apps, the Fabric Controller would react with a restore process. In mid-March 2009, Windows Azure suffered a major malfunction, also cutting testers from their applications.

“In addition to fixing the particular bug mentioned above, we’re continuing our ongoing work to improve our detection and response algorithms in the Fabric Controller, including giving the Fabric Controller a more granular view of the health of applications,” the Azure team member promised.