From: dhk@teletech.uucp (Don H Kemp)
Here's AT&T's _official_ report on the network problems, courtesy of the AT&T Consultant Liason Program.
--Don
Technical background on AT&T's network slowdown, January 15, 1990.
At approximately 2:30 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City experienced a minor hardware problem which activated normal fault recovery routines within the switch. This required the switch to briefly suspend new call processing until it completed its fault recovery action - a four-to-six second procedure. Such a suspension is a typical maintenance procedure, and is normally invisible to the calling public.
As part of our network management procedures, messages were automatically sent to connecting 4ESS switches requesting that no new calls be sent to this New York switch during this routine recovery interval. The switches receiving this message made a notation in their programs to show that the New York switch was temporarily out of service.
When the New York switch in question was ready to resume call processing a few seconds later, it sent out call attempts (known as IAMs - Initial Address Messages) to its connecting switches. When these switches started seeing call attempts from New York, they started making adjustments to their programs to recognize that New York was once again up-and-running, and therefore able to receive new calls.
A processor in the 4ESS switch which links that switch to the CCS7 network holds the status information mentioned above. When this processor (called a Direct Link Node, or DLN) in a connecting switch received the first call attempt (IAM) from the previously out-of-service New York switch, it initiated a process to update its status map. As the result of a software flaw, this DLN processor was left vulnerable to disruption for several seconds. During this vulnerable time, the receipt of two call attempts from the New York switch - within an interval of 1/100th of a second - caused some data to become damaged. The DLN processor was then taken out of service to be reinitialized.
Since the DLN processor is duplicated, its mate took over the traffic load. However, a second couplet of closely spaced new call messages from the New York 4ESS switch hit the mate processor during the vulnerable period, causing it to be removed from service and temporarily isolating the switch from the CCS7 signaling network. The effect cascaded through the network as DLN processors in other switches similarly went out of service. The unstable condition continued because of the random nature of the failures and the constant pressure of the traffic load in the network providing the call-message triggers.
The software flaw was inadvertently introduced into all the 4ESS switches in the AT&T network as part of a mid-December software update. This update was intended to significantly improve the network's performance by making it possible for switching systems to access a backup signaling network more quickly in case of problems with the main CCS7 signaling network. While the software had been rigorously tested in laboratory environments before it was introduced, the unique combination of events that led to this problem couldn't be predicted.
To troubleshoot the problem, AT&T engineers first tried an array of standard procedures to reestablish the integrity of the signaling network. In the past, these have been more than adequate to regain call processing. In this case, they proved inadequate. So we knew very early on we had a problem we'd never seen before.
At the same time, we were looking at the pattern of error messages and trying to understand what they were telling us about this condition. We have a technical support facility that deals with network problems, and they became involved immediately. Bell Labs people in Illinois, Ohio and New Jersey joined in moments later. Since we didn't understand the mechanism we were dealing with, we had to infer what was happening by looking at the signaling messages that were being passed, as well as looking at individual switches. We were able to stabilize the network by temporarily suspending signaling traffic on our backup links, which helped cut the load of messages to the affected DLN processors. At 11:30 p.m. EST on Monday, we had the last link in the network cleared.
On Tuesday, we took the faulty program update out of the switches and temporarily switched back to the previous program. We then started examining the faulty program with a fine-toothed comb, found the suspicious software, took it into the laboratory, and were able to reproduce the problem. We have since corrected the flaw, tested the change and restored the backup signaling links.
We believe the software design, development and testing processes we use are based on solid, quality foundations. All future releases of software will continue to be rigorously tested. We will use the experience we've gained through this problem to further improve our procedures.
It is important to note that Monday's calling volume was not unusual; in fact, it was less than a normal Monday, and the network handled normal loads on previous weekdays. Although nothing can be guaranteed 100% of the time, what happened Monday was a series of events that had never occurred before. With ongoing improvements to our design and delivery processes, we will continue to drive the probability of this type of incident occurring towards zero.