Networking problems offer many potential lessons that must be assimilated by researchers, developers, and operators of highly distributed systems such as computer networks and electric power distribution. In this column, I briefly revisit some widespread outages and other extreme network behavior, and explore ever-increasing needs for trustworthy distributed control.
The 1980 four-hour ARPANET collapse (a combination of unchecked memory errors and a weak garbage-collection algorithm) and the 1990 half-day AT&T long-distance network collapse (a bug in newly installed software for autorecovery from telephone switch failures) are examples of iteratively propagating system failures. Similarly, numerous widespread power outages have occurred, notably the Northeast U.S. blackout of November 1965, the New York State blackout in July 1977, the blackout affecting 10 western U.S. states in October 1984, separate outages in July and August 1996 that took out much of the western U.S. (the second of which also affected parts of Canada and Baja California in Mexico), and the northeastern U.S. again in August 2003. In each case, a single-point failure triggered a cascading effect.
Various other serious power outages include the month-long outage in Quebec in January 1998 (due to the collapse of ice-laden transmission towers) and the week-long area outages in Queens, NY in July 2006 (due to fires and failures in 100-year-old wiring). In addition, many less dramatic outages have been reported, including a power failure that triggered fire alarms and evacuation of Oregon's Portland Convention Center during the ACM OOPSLA conference on October 26, 2006, and also shut down the surrounding area including the light rail system, for almost an hour. Additional power outages were compounded by failures of backup power systems, such as the 1991 four-hour shutdown of New York City's three airports and an ESS telephone system (due to misconfiguration of the backup power, which instead of being driven by the activated generators was running on standby batteries until they were drained).
The most recent case of a cascading power outage occurred on November 4, 2006, and affected approximately 10 million people in Germany, Austria, Italy, France, Spain, and Portugal. Among other disruptions, 100 regional Deutsche Bahn trains were affected by the outage. The widespread outage began with a supposedly routine event, in which a power line over the Ems River in northern Germany was shut down to allow a ship (the Norwegian Pearl) to pass safely. Following the so-called "N-1 criterion" for stability that must be applied before prospective reconfiguration can be authorized, alternative power was procured to compensate for the line that would be shut down, and simulations were run to demonstrate the power sufficiency of that preventive measure. However, a second-order reevaluation was not conducted following the reconfiguration to analyze the increased loads caused by the shutdown itself, and the resulting overload propagated across Europe, lasting about four hours before power could be restored.
What is perhaps most alarming about these and other unanticipated outages is that such disruptions continue to occur—despite remedial efforts or in the absence of such efforts. Furthermore, the effects are not simply cascading or unidirectionally propagating, because in many cases feedback loops exacerbate the breadth and speed of the outages. Clearly, more proactive efforts are needed to analyze systems for potential widespread fault modes and ensuing risks, and to enable appropriate real-time responses and automated remediations. (Of course, proactive defenses are also desirable against environmental disasters such as tidal waves, global warming, and hurricanes.)
Distributed control of distributed systems with distributed sensors and distributed actuators is typically more vulnerable to widespread outages and other perverse failure modes such as deadlocks and other unrecognized hidden interdependencies (particularly among components that are untrustworthy and perhaps even hidden), race conditions and other timing quirks, coordinated denial-of-service attacks, and so on. Achieving reliability, fault tolerance, system survivability, security, and integrity in the face of adversities in highly distributed systems is problematic; formal analyses and system testing are highly desirable, but are potentially more complex than in systems with more centralized control.
Proactive system architectures and further analytic efforts are needed to prevent and quickly remediate such problems in power distribution and information networks. Long-term planning, closer oversight, and multipartite coordination are also essential, along with improvements in university curricula and operational management.
©2007 ACM 0001-0782/07/0200 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc.
No entries found