Blackout Blame Game
Murphy is a nitpicky bastard. Whenever something goes wrong it’s usually the result of the confluence of a number of small things that if taken by themselves would likely be considered nuisances, but not critical. On Saturday Slashdot picked up this article from SecurityFocus that summarized the results of the report issued by the blackout investigators.
The problem began when three of FirstEnergy‘s high-power lines sagged into trees. Normally, this would have sounded an alarm and the operators would have routed around the trouble. A few people might have lost power, but that would have likely been the end of it. However, it turns out that there was a critical bug in the GE Energy XA/21 Energy Management System (EMS).
Sometimes working late into the night and the early hours of the morning, the team pored over the approximately one-million lines of code that comprise the XA/21’s Alarm and Event Processing Routine, written in the C and C++ programming languages. Eventually they were able to reproduce the Ohio alarm crash in GE Energy’s Florida laboratory, says Unum. “It took us a considerable amount of time to go in and reconstruct the events.” In the end, they had to slow down the system, injecting deliberate delays in the code while feeding alarm inputs to the program. About eight weeks after the blackout, the bug was unmasked as a particularly subtle incarnation of a common programming error called a “race condition,” triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds.
“There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time,” says Unum. “And that corruption led to the alarm event application getting into an infinite loop and spinning.”
The GE representative thinks that this kind of bug would not have been caught, even with more testing.
The company did everything it could, says Unum. “We text exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,” says Unum. “I’m not sure that more testing would have revealed that. Unfortunately, that’s kind of the nature of software… you may never find the problem. I don’t think that’s unique to control systems or any particular vendor software.”
I’m not so sure that he’s right about that. Allowing multiple concurrent accesses to a data structure sounds like a problem of poor locking control over that structure. This could occur because the developers didn’t realize the need for locking in competing modules (which can happen when dependencies aren’t well communicated, or the side-effects aren’t well considered). It can also happen inadvertently through copying of pointers to the data (C and C++ are highly flexible, but this flexibility allows for all kinds of nasty problems like this).
In any event, it may be possible to prevent these kinds of bugs, but the question becomes one of cost. A rigorous, process-based development environment like SEI CMM Level 5 can help alleviate the problem (as an example, the Space Shuttle’s onboard software development team is at CMM Level 5). However, this requires additional time and effort to do rigorous, line-by line inspection of the code as well as thorough documentation. Reviews are done at every step (requirements, design, development/code, test plans, etc). Every step is documented as to defects found and corrected, and each defect is analyzed to see where in the process (requirements, design, code) it was injected. Then actions are devised to prevent that type of defect from being repeated.
But this is slow, and costly. Most commercial software development is under time and budget pressure which results in people working long hours with few breaks just to “get it out the door.” Unit testing, functional testing, system testing, and systems integration testing will find the majority of the bugs, but that just means that you’re left with the nasty, pernicous ones. The kinds that only show up after 3 million hours of operation. It requires up-front time and effort to prevent these kinds of bugs.