MIT E-mail Access InterruptedBy Keith J. Winstein
NEWS AND FEATURES DIRECTOR
Nearly five thousand Athena users were without e-mail Monday in what was said to be MIT’s longest mail outage ever.
One of MIT’s five mail servers, known as po11, was taken offline at 10:30 p.m. Sunday night. Service was restored at 12:45 a.m. Tuesday morning.
The outage was later traced to an obscure problem in the Sun Microsystems Solaris operating system and is believed to have been triggered by a hardware failure, said Network Manager Jeffrey I. Schiller ’79 of Information Systems’ Network Services Team, which runs MIT’s e-mail service.
“The mail’s been queued, so we believe we will have no data loss,” said Senior Systems Programmer Thomas J. Coppeto ’89.
Schiller said that mail received and queued while po11 was down would probably be delivered by Tuesday morning.
The machine serves 4,659 users who were randomly assigned to it, Coppeto said.
“This is the longest mail server outage I think we’ve ever had,” he said.
Solaris bug believed responsible
“This is one of those ‘not ever supposed to happen problem[s],’” the team said in a Web page posted to help explain the outage.
“Files on the T3 Raid array on PO11 disappear and re-appear at random,” the page said. A RAID array is a collection of hard drives grouped together to store information with greater reliability and speed.
“File system checks show that everything is fine, all operating logs show no problems, but sometimes files just don’t appear to be there, but then are back a few minutes later,” the page said.
Twenty-four hours into repairing the problem, and with little sleep, the team was still willing to joke a bit. “The bug itself is very weird,” Schiller said. “That’s the technical explanation,” Coppeto joked.
Earlier in the day, Coppeto said po11 was acting like the nonexistent -- and unlucky -- “po13.”
Later, after the system was repaired, Schiller said an MIT alumnus who works for Sun Microsystems, William E. Sommerfeld ’88, “was able to track through [Sun’s] internal databases and find an obscure software problem that we suspect was triggered by a hardware failure,” Schiller said.
The problem has a known workaround, Schiller said, and MIT is now “running on the original unit with the fix in place.”
Users make do in different ways
Users affected by the outage, roughly one fifth of MIT, reported they are coping well with the loss of e-mail access.
“I was able to find other things to occupy my mind,” said Physics Professor Frank Wilczek of his day-long loss of e-mail. “I’m kinda worried about what’s going to hit me tomorrow or when it comes back.”
“I haven’t been sweating it much,” said Dina H. Feith ’03, adding that she nonetheless hoped her e-mail would return soon, because she is a TA and her students would be seeking her help preparing for final exams.
“I had a prof who was fairly upset that I didn’t realize that my oral exam got rescheduled,” said Benazeer S. Noorani ’04. “And our house cook got upset that I didn’t realize that we needed more bread for dinner.” Both were the result of unreceived e-mail, she said.
“There were a lot of phone calls in the morning, as people appeared,” wrote User Accounts Consultant Laura E. Baldwin ’89 over Zephyr. Baldwin answers user questions for Information Systems, and was herself without e-mail for some of yesterday because of the outage.
“It’s frustrating to not have a useful thing to tell people,” she said.
Predictions varied through day
Communication with users about the outage varied with the Network Services Team’s decreasing optimism throughout the day.
At 7:00 a.m. Monday morning, the team’s advisory on the MIT Services Status Page at http://web.mit.edu/3down said “we expect [reconstruction] to be complete by 8 a.m.” At 9:15 a.m., the advisory predicted a return of service in “20 minutes.”
As the scope of the mysterious problem became more apparent, those predictions had to be revised. “At this time we estimate service should be restored at 5pm,” the team’s advisory said at 12:35 p.m., a prediction that was updated to ‘sometime tonight” at 3 p.m.
Meanwhile, the telephone version of the services status page, at 3-DOWN, continued to report e-mail as “fully functional” throughout the day.
“Data recovery is still progressing and approaching completion,” the team’s advisory said at 11:20 p.m.
Finally, at 12:50 a.m. Tuesday, the advisory announced “po11 is UP!!!” and apologized for the length of the outage.
Shortly after the restoration of service, Coppeto and Schiller called a reporter to announce that the machine was back in service.
Their relief -- 27 hours after the machine was first taken out of service -- was palpable. “Do you want to hear what happened?” Schiller asked.