The service's management system went down due to a "leap day" software bug

Mar 12, 2012 15:12 GMT  ·  By

Windows Azure’s management portal went down for several hours on February 29th, the very day Microsoft was launching the Windows 8 Consumer Preview operating system flavor at the Mobile World Congress in Barcelona, Spain.

At the time, the company was keeping customers informed on how the restoration of the service was advancing via its support dashboard, and also offered info on the matter through a post on the Windows Azure blog.

Also there, the company confirmed that the issue was solved, and provided some info on what caused the outage: a software bug.

“The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year,” Bill Laing, corporate VP server and cloud, Microsoft, stated.

Many customers in various regions around the world experienced issues with the service even after Microsoft found the root cause for the outage and started to solve the problem.

In a recent post on the aforementioned Windows Azure post, Bill Laing delivered full details on why the service went down on February 29th.

“We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues, and how we are learning from the incident to prevent a similar occurrence in the future,” he notes.

“Again, we sincerely apologize for the disruption, downtime and inconvenience this incident has caused. We will be proactively issuing a service credit to our impacted customers as explained below. Rest assured that we are already hard at work using our learnings to improve Windows Azure.”

He explains that the issue was triggered at 00:00 UST on February 29th, and that it was caused by a leap day bug.

The guest agent uses the current day as the valid-from date and adds one year to calculate the valid-to date. In this case, that date was the invalid date of 2013 that caused the certificate creation to fail and the system to crash.

The bottom line is that Microsoft is set to spend some more time to analyze the issue and that it is committed to make sure that this won’t happen again. Those who would like to access full technical details on the said issue and on Microsoft’s solution for it should head over to this blog post on MSDN.