The Tech - Online EditionMIT's oldest and largest
newspaper & the first
newspaper published
on the web
Boston Weather: 80.0°F | Mostly Cloudy

Failures Shut Down Campus Network Twice in Last Week

By Jeremy Hylton

A pair of computer problems disrupted service of the campus network and the Athena Computer Environment yesterday afternoon.

A hardware problem in Building NW12 and a problem with the AFS network software caused the disruptions, according to Gregory A. Jackson, director of the Academic Computing Services division of Information Systems.

"In broad terms, today was a day no one would wish on their enemies," Jackson said.

The network software problems began at 2:20 p.m. yesterday and lasted until about 3:05 p.m., according to Sameer Raheja '96, an Athena consultant.

The problems slowed the system to a crawl, Raheja said. "Basically, it took a very long time to log in [to Athena workstations] and everything was really slow," he said. "In some cases, people could not log in."

The hardware problem shut down network access for most of west campus including the Student Center and Resnet users in the dormitories, according to Matthew H. Braun '93, a programmer with the Distributed Computing and Network Services division of IS. A router failure in Building NW12 caused the problem, he said.

Problem Thursday, Monday

The other problem was with the Andrew File System, the software which links file servers and workstations at MIT with each other and others across the Internet. AFS file servers hold individual users' lockers.

Yesterday's AFS failure is the second in the past week. Last Thursday, the problem caused the top-level MIT AFS server to fail and forced DCNS to shut down and restart all Athena file servers, according to Kimberly A. Carney, a supervisor in DCNS.

The problem develop again yesterday afternoon, but DCNS was not forced to shut down the file servers. "Fortunately, because of Thursday's experience, we were able to take some steps to mitigate the problem," Carney said.

When the servers were restarted Thursday, network service was disrupted for about an hour, Carney said.

One of the ways the problem manifested itself was that client computers overloaded the AFS management servers. There are many servers and normally clients choose one at random to talk to, leading to a relatively even load on the servers, Carney said.

During the problems Thursday and yesterday, it seemed that clients communicated to only a few of the servers and overloaded them, according to Carney.

DCNS is investigating several possible causes of the problem and has taken steps that will hopefully prevent future problems, Carney said.

Carney said there are three potential problems IS is investigating:

There may be hardware problems with the server computers. Some hardware may be faulty; it is being replaced today, Carney said.

Configuration parameters may need adjustment. There are several parameters than can be set on AFS servers and adjusting them may improve performance, Carney said. Some adjustments have been made already, she said.

Bugs may exist in the AFS software. "The other sort of big thing is bugs in AFS or bugs with the protocol in which clients talk to server," Carney said.

Another more general cause of trouble may be the sheer size of the MIT network system. The campus network is one of the largest single installations of AFS, and it could be too large to be handled gracefully, Carney said.

Little effect on coursework

Though computers on the campus network are widely used for coursework and research, the effects of the two recent outages were relatively small, Jackson said.

"I don't think either of the recent outages was a major problem in this regard, especially since both occurred during the middle of the day whereas Athena load is concentrated in the late afternoon and evening," Jackson said.

"Of course, any time basic utilities fail there can be interference with work," he said.

"There have certainly been cases where specific software needed in a subject has failed, and the instructors have extended due dates and so forth," Jackson continued.