Biz & IT —

Microsoft: Botched upgrade caused by DNS problem led to Windows Live outage

The Windows Live outage that took down Hotmail and SkyDrive on Sept. 8 was …

The Windows Live outage that took down Hotmail and SkyDrive on Sept. 8 was caused by a failed upgrade to a tool that balances network traffic, Microsoft has explained. The update went awry because of a corrupted file in Microsoft’s DNS service.

“A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption,” Windows Live test and service engineering VP Arthur de Haan wrote in a blog post Tuesday. “We determined the cause to be a corrupted file in Microsoft’s DNS service. The file corruption was a result of two rare conditions occurring at the same time. The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client. Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.”

DNS problems also took Office 365 offline on the same day, although de Haan’s blog post only discusses Windows Live. The Windows Live outage took more than an hour to resolve “although it took some time for the changes to replicate around the world and reach all our customers,” he writes. To prevent future outages, Microsoft promised to implement better processes for monitoring, problem identification and recovery, as well as a “further hardening [of] the DNS service to improve its overall redundancy and fail-over capability.”

“We are also developing an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored,” de Haan writes. “In addition, we are reviewing the recovery tools to see if we can make more improvements that will decrease the time it takes to resolve outages. We are determined to deliver the very best possible service to our customers and regret any inconvenience caused by this outage.”

Channel Ars Technica