|
A tiny software "bug" — three faulty computer instructions lurking among several million that modern telephone switches use to route calls — appears to have sparked the sudden, massive failures of local telephone systems in recent weeks, the manufacturer of the equipment said yesterday. The flaws were hidden in minor software changes that the Plano, Tex.-based company, DSC Communications Corp., provided to its telephone company customers in an effort to improve the equipment's performance. The software was sent out without major testing because DSC judged that the changes were too small to require it, DSC Vice President Frank Perpiglia said yesterday.
It was the first detailed explanation of what caused the unprecedented string of failures. In rapid sequence in June and earlier this month, local phone service in the Washington region, Pittsburgh, San Francisco and Los Angeles mysteriously crashed, for as long as eight hours.
Perpiglia said the software flaw appeared to be the "root cause" of failures of DSC-built computers called signal transfer points that have been at the heart of the investigation. But he said that in the telephone network as a whole, in which equipment made by many different companies is linked together, other causes might be found as well.
Despite DSC's statement, major telephone companies and equipment manufacturers yesterday continued a high-energy investigation of the incidents, which rank among the most disruptive software-related failures in U.S. history. In the meantime, telephone companies have installed in their switches special software "patches" that DSC provided, and they said that the fixes should prevent any recurrence of the failures.
Software consists of electronic instructions that in complex programs can number in the millions. Computers move through the instructions sequentially to perform tasks as diverse as routing telephone calls, running a word processor, guiding missiles and keeping track of company payrolls.
A single misplaced command — one that tells a machine to look in the wrong data base when a particular piece of information is needed, for instance — can throw a computer into confusion. Usually such bugs are detected during laboratory testing, but many survive this process.
DSC's explanation underlines fears that as computer software grows more complex and takes over |
greater functions in companies and government, society will be increasingly vulnerable to massive failures.
Software engineers are developing new ways to test software and mathematically prove that it is error-free. Other theories say software should be designed with the assumption that bugs will occur but that any damage they cause will be containable.
Software developers continually come up with refinements in existing programs. Perpiglia said DSC had made changes at the request of one customer to basic software that runs the signal transfer points. Between December last year and April, the change was distributed to five of the seven regional telephone companies.
"DSC did not go through the normal process" of four months of testing before sending the change out, Perpiglia told members of the House subcommittee on telecommunications and finance yesterday. "Because the change was small ... we felt that the change itself did not require such testing."
Analysis of the computer instructions led DSC to pinpoint the problem in a specific section of the change, Perpiglia said. The flaws, he said, consisted of only three "binary digits" of information — three flawed digits among millions of correct ones.
DSC computers are designed to route calls around minor failures that routinely occur in a modern telephone network, such as the failure of a single circuit board. But due to the errors in the program, the computers responded by pumping out floods of erroneous internal messages that crowded out routine messages that serve to route phone calls, causing the systems to shut down.
Phone company executives said they had no way to explain why the failures on both coasts had occurred so close in time to one another. In a closed session yesterday, the Federal Communications Commission decided to step up its own research into network reliability, to establish a special staff for that job and to establish formal reporting requirements for network outages.
"The commission continues to have full faith in the fundamental strength of our public telephone network," FCC Chairman Alfred Sikes said. "And we are persuaded that the recent events are — in all probability — inadvertent side effects of continuing progress, rather than evidence of any fundamental flaws."
Staff writer Evelyn Richards contributed to this report. |