The Tech - Online EditionMIT's oldest and largest
newspaper & the first
newspaper published
on the web
Boston Weather: 87.0°F | Overcast

Main Web Site Experiences Outages Throughout Week

By John A. Hawkinson

MIT’s primary Web site,, experienced a series of debilitating outages last week. The disruptions began on Sunday Feb. 5, and continued to be unavailable for several hours a day through Friday, when a temporary solution was put into place. A more permanent solution, removing’s dependence on the servers causing the trouble, was installed last night.

Information Services and Technology is responsible for maintaining, which is actually five separate web servers operating behind a load balancer. Three of those web servers are used for normal traffic, and one is dedicated to serving web traffic from the Google search engine indexer, according to Jeff I. Schiller ’79, who manages MIT’s network for IS&T. The fifth server provides several miscellaneous services, including taking some Google traffic.

According to Schiller, the root cause of the outage is unclear. MIT’s web servers depend on the Andrew File System (AFS) to access the data they provide to web browsers. IS&T traced the problem to a specialized group of AFS file servers that the servers depend on (called the AFS cell), but are not frequently used by the general community. Repeated failures of those file servers brought the servers to their knees; Schiller speculated that some machine accessing them was triggering a subtle bug.

The bulk of the Web data that is served by the servers actually comes from a different group of AFS file servers, called the AFS cell; those servers were not affected by the failures last week. Until Monday, the AFS cell stored the top-level homepage, as well as ancillary administrative information used by the web servers. Despite the fact that only the AFS servers were affected, the failures were sufficiently catastrophic to prevent the Web servers from functioning at all.

To restore service to as quickly as possible, while the true cause of the failure was not understood, IS&T restricted access to the AFS file servers to only a handful of machines (including the servers), in an effort to prevent the bug from being triggered. Since very few users at MIT need to access the AFS servers, this was considered an acceptable temporary solution.

Yesterday evening, at around 8:45 p.m., Schiller and Mark V. Silis, manager of network infrastructure and services for IS&T, worked to implement a more long-term solution. According to Schiller, the servers were reconfigured to remove their dependence on the AFS servers, and then those AFS servers were upgraded to a more recent software version. Previously they had been running six-year-old software, whereas the servers, which did not have these problems, are running software from mid-2005.

IS&T has not been able to provide a clear explanation for why the outages lasted from Sunday until Friday without improvement. Theresa Regan, director of operations and infrastructure services for IS&T, did not respond to inquiries regarding the outage time frame.

Outages were first reported shortly before 10 p.m. on Sunday, Feb. 5. IS&T’s 3DOWN outage announcement service,, did not acknowledge the problem until after 7 a.m. on Wednesday, Feb. 8.

During the outages, there were other problems as well, which may have affected IS&T’s ability to diagnose and repair the Web problems. For instance, according to Schiller, one of the Webmail servers began to behave erratically, and the problem was ultimately traced to a bad CPU; that machine was removed from service. Webmail is MIT’s Web-based e-mail service, but it does not rely on the servers. Schiller suggested that the search for a common fault between Webmail and may have made it more difficult to isolate each individual problem.