Routine maintenance gone wrong

Oct 14, 2009 10:15 GMT  ·  By

The Internet in Sweden broke down on Monday for at least one hour because of an error introduced during a routine maintenance update of the .se zone. Internet service providers had to manually flush the cache of their DNS servers in order to restore proper functionality.

Around 21:45 on Monday, Internet users from across the world stopped being able to access domain names ending in .se, the country code top-level domain for Sweden. The .se registry counts almost 905,000 domain names and is operated by the Internet Infrastructure Foundation (.SE), which was the first TLD maintainer in the world to offer DNSSEC services.

Pingdom, a Sweden-based company that provides website performance and monitoring, reports on its blog that the error actually consisted of a dot character not being added by the update script. "We have spoken to a number of industry insiders and what happened is that when updating the data, the script did not add a terminating '.' to the DNS records in the .se zone. That trailing dot is necessary in the settings for DNS to understand that '.se' is the top-level domain. It is a seemingly small detail, but without it, the whole DNS lookup chain broke down," the company explains.

The problem was theoretically resolved in about an hour, as the Internet Infrastructure Foundation rolled out a new zone file. However, in practice, once a DNS change occurs, it can take up to 24 hours to propagate across the entire Internet. Most DNS servers do not update their records in real-time, but at a predefined interval which might differ from setup to setup.

For example, let’s suppose that DNS Server 1 recorded the original error 20 minutes after it happened. Then, a DNS Server 2, which updates its records from DNS Server 1, cached the error after another 30 min. Subsequently, a DNS Server 3 queried it for updates after an hour. This means that DNS Server 3 recorded the error after one hour and 50 minutes, 50 minutes after it was already fixed at the source.

Users at the end of a longer chain of servers will take more time to both experience the error and get the resolution. Therefore, by the time the problem is resolved for users from large providers, the costumers of smaller ISPs that are further down the line could just start experiencing it.

This original zone file fix deployed within one hour intentionally lacked correct DNSSEC signatures, allowing it to propagate faster. According to the Internet Infrastructure Foundation, a properly signed zone file was distributed by 01:00 am. ISPs were strongly advised to manually flush the cache on their DNS servers.