Over 4,000 community members lost e-mail access early Wednesday morning in an outage that still affects some users.
One of MIT's five e-mail servers, po14, crashed sometime before 8 a.m. on Wednesday, March 7, said Jeffrey I. Schiller '79, MIT Network Manager for Information Services and Technology.
Schiller said the problem arose when po14 experienced a kernel panic (similar to Windows's blue screen of death), triggering an automatic restart of the mail server. Upon restart, the server detected file system corruption that required manual repair by IS&T technicians.
By Thursday afternoon, over 3,000 of the 4,000 users had e-mail restored, though there was a large backlog of incoming messages.
As of Thursday evening, roughly 500 users on po14 were still without e-mail; Schiller estimated the service would be restored for everyone by 9:30 a.m. Friday.
Because of MIT's redundancy and backups, Schiller said he was "not too worried about data loss."
On Wednesday, he estimated a maximum of 10 messages across the whole system would be corrupted, a number he revised to three on Thursday. Those three messages had likely been saved in regular backups, he said.
Users on po14 who forward their e-mail to external servers, such as Gmail, were unaffected by the outage.
While the root cause of the outage is unclear, IS&T's 3DOWN Service Status page characterized the outage as extremely rare. According to Schiller, IS&T simply "didn't [fore]see this happening."
IS&T has localized the error to the file system on po14. MIT maintains a RAID file system on e-mail servers, so that mail messages are preserved across multiple hard drives to prevent failure. Unfortunately, something caused a small amount of data corruption on the RAID system and eventually triggered the kernel panic that caused po14 to restart, said Schiller.
On reboot, po14 ran the application 'fsck,' which is designed to check and repair corrupted files. While operating, fsck reads a small amount of data from every single file on a system. Fsck ran for nearly 24 hours, trying to repair the nearly 27 million files on po14, said Schiller. "It [was] mind-numbing," he said.
MIT experienced a similar e-mail outage in the first week of May 2003. During that incident, a bug in the operating system of the mail server po11 caused file corruption and triggered a file consistency check.
"In that outage, fsck took four hours to run," said Schiller, a fact he attributed to a smaller quota size of 100 megabytes. The current mail quota is 1 gigabyte.
Because fsck was taking too long, IS&T halted the program Thursday morning and switched to "plan B," copying the files from po14 to a duplicate file system. According to Schiller, po14 mail files are split into four partitions, each with roughly 1,000 users. Three of the partitions were intact, allowing roughly 3,000 users to regain e-mail; one of the partitions was corrupt, requiring manual repairs.
In an e-mail, Jerrold M. Grochow '68, vice president for IS&T described the outage as "an unacceptable length of time for e-mail to be unavailable to 20 percent of our community." Grochow also outlined a project to provide "completely redundant" mail service that began in early 2006 with the goal of completion in Summer 2007.
Schiller said plans to upgrade e-mail would be finalized in the next week, but considered complete redundancy extremely difficult to attain. One option under consideration is to break apart MIT's five large mail servers into 40 or more servers, so that an outage would impact fewer users and could be repaired more quickly.
Outside services such as Google have offered to run MIT's e-mail, but Schiller is wary of the security and privacy of such services. "Whose mail is it anyway?" he asked.
Michael McGraw-Herdeg contributed to the reporting of this article.