Main Web Site Experiences Outages Throughout Week
By John A. Hawkinson
MIT’s primary Web site, http://web.mit.edu/, experienced a series of debilitating outages last week. The disruptions began on Sunday Feb. 5, and web.mit.edu continued to be unavailable for several hours a day through Friday, when a temporary solution was put into place. A more permanent solution, removing web.mit.edu’s dependence on the servers causing the trouble, was installed last night.
Information Services and Technology is responsible for maintaining web.mit.edu, which is actually five separate web servers operating behind a load balancer. Three of those web servers are used for normal traffic, and one is dedicated to serving web traffic from the Google search engine indexer, according to Jeff I. Schiller ’79, who manages MIT’s network for IS&T. The fifth server provides several miscellaneous services, including taking some Google traffic.
According to Schiller, the root cause of the outage is unclear. MIT’s web servers depend on the Andrew File System (AFS) to access the data they provide to web browsers. IS&T traced the problem to a specialized group of AFS file servers that the web.mit.edu servers depend on (called the net.mit.edu AFS cell), but are not frequently used by the general community. Repeated failures of those file servers brought the web.mit.edu servers to their knees; Schiller speculated that some machine accessing them was triggering a subtle bug.
The bulk of the Web data that is served by the web.mit.edu servers actually comes from a different group of AFS file servers, called the athena.mit.edu AFS cell; those servers were not affected by the failures last week. Until Monday, the net.mit.edu AFS cell stored the top-level web.mit.edu homepage, as well as ancillary administrative information used by the web servers. Despite the fact that only the net.mit.edu AFS servers were affected, the failures were sufficiently catastrophic to prevent the
web.mit.edu Web servers from functioning at all.
To restore service to web.mit.edu as quickly as possible, while the true cause of the net.mit.edu failure was not understood, IS&T restricted access to the net.mit.edu AFS file servers to only a handful of machines (including the web.mit.edu servers), in an effort to prevent the bug from being triggered. Since very few users at MIT need to access the net.mit.edu AFS servers, this was considered an acceptable temporary solution.
Yesterday evening, at around 8:45 p.m., Schiller and Mark V. Silis, manager of network infrastructure and services for IS&T, worked to implement a more long-term solution. According to Schiller, the web.mit.edu servers were reconfigured to remove their dependence on the net.mit.edu AFS servers, and then those AFS servers were upgraded to a more recent software version. Previously they had been running six-year-old software, whereas the athena.mit.edu servers, which did not have these problems, are running software from mid-2005.
IS&T has not been able to provide a clear explanation for why the outages lasted from Sunday until Friday without improvement. Theresa Regan, director of operations and infrastructure services for IS&T, did not respond to inquiries regarding the outage time frame.
Outages were first reported shortly before 10 p.m. on Sunday, Feb. 5. IS&T’s 3DOWN outage announcement service, http://is3down.mit.edu/, did not acknowledge the problem until after 7 a.m. on Wednesday, Feb. 8.
During the web.mit.edu outages, there were other problems as well, which may have affected IS&T’s ability to diagnose and repair the Web problems. For instance, according to Schiller, one of the Webmail servers began to behave erratically, and the problem was ultimately traced to a bad CPU; that machine was removed from service. Webmail is MIT’s Web-based e-mail service, but it does not rely on the web.mit.edu servers. Schiller suggested that the search for a common fault between Webmail and web.mit.edu may have made it more difficult to isolate each individual problem.