The Design of Disaster

10 min

Introduction

Narrator: Imagine a state-of-the-art radiation therapy machine, a marvel of modern medicine designed to save lives by targeting cancerous tumors with precision. Now, imagine that same machine, due to a hidden software flaw, delivers a massive overdose of radiation—one hundred times the intended dose—burning patients and leading to several deaths. The most chilling part? No single component of the machine ever broke. Every part worked exactly as it was designed. The disaster wasn't caused by a failure, but by the system's design itself. This real-world tragedy of the Therac-25 machine in the 1980s exposes a fundamental flaw in how we traditionally think about safety. It forces us to ask a difficult question: if we can't prevent accidents by simply making better parts, how can we ever hope to build safe complex systems?

The answer to that profound challenge is explored in Nancy G. Leveson's seminal work, An Introduction to System Safety Engineering. This book dismantles outdated safety models and provides a revolutionary framework for understanding and ensuring safety in the complex, software-driven world we now inhabit. It argues that safety is not about preventing individual failures, but about designing control into the very fabric of our systems.

Traditional Safety Models Are Dangerously Obsolete

Key Insight 1

Narrator: For decades, the dominant approach to safety engineering was based on what are known as chain-of-events models. These models view accidents as the end result of a linear sequence of failures, much like a row of dominoes tipping over. The logic is simple: if you can find and remove the first or any subsequent domino, you can prevent the final accident. This led to a focus on component reliability. The goal was to build stronger, more reliable parts and to conduct exhaustive failure analyses, like a Failure Modes and Effects Analysis (FMEA), to predict how individual components might break.

However, as Leveson explains, this model is fundamentally unequipped to handle the complexity of modern systems, especially those with significant software components. In the case of the Therac-25, there was no "broken" component to blame. The accident occurred because of an unforeseen interaction between the software's user interface and its control logic. A skilled operator, using a common key sequence to edit the treatment parameters, could inadvertently trigger a "race condition" in the software, causing the machine to activate its high-power electron beam without the necessary safety plate in position.

The domino model fails here because it cannot account for accidents that emerge from the interactions between perfectly functioning components. It treats systems as simple mechanical aggregates, not as dynamic, interconnected wholes. In today's world of autonomous cars, automated air traffic control, and AI-driven medical diagnostics, accidents are far more likely to arise from flawed requirements, design errors, and unexpected system behaviors than from a simple hardware part breaking. Leveson argues that clinging to this outdated model is not just ineffective; it's a direct threat to public safety.

Safety Is a Control Problem, Not a Reliability Problem

Key Insight 2

Narrator: The central paradigm shift proposed in the book is the redefinition of safety itself. Instead of viewing safety as the absence of failures, Leveson reframes it as a control problem. In this view, every system is designed to achieve certain goals while simultaneously avoiding undesirable outcomes, or hazards. An accident, therefore, is not the result of a component failure, but a consequence of inadequate control that allows the system to enter a hazardous state.

To illustrate this, consider the system of a car. Its purpose is to transport passengers. A hazard is colliding with another object. Safety is maintained through a complex control structure. The driver provides control by steering, accelerating, and braking. The car's internal systems, like anti-lock brakes (ABS), provide automated control. Traffic laws, road signs, and lane markings provide external, societal-level control. An accident occurs when this web of control breaks down. The driver might be distracted, the ABS might not be designed to handle icy roads, or a stop sign might be missing.

In each case, the problem is a flaw in the control loop. A command was given that was unsafe, a necessary command was not given, or a correct command was executed improperly or at the wrong time. This perspective is profoundly important because it shifts the focus of safety engineering away from simply reacting to failures and toward proactively designing and enforcing constraints on system behavior. The critical question is no longer "What can break?" but rather "What are the unsafe control actions that could lead to a hazard, and how do we prevent them?"

A New Accident Model for a Complex World: STAMP

Key Insight 3

Narrator: To operationalize this new perspective, Leveson developed a new accident model called STAMP, which stands for Systems-Theoretic Accident Model and Processes. Unlike chain-of-events models, STAMP treats accidents as an emergent property of the entire socio-technical system. It doesn't look for a single root cause or a linear path of failure. Instead, it models the system as a hierarchy of control loops.

At every level of a complex system—from the physical hardware to the software, operators, company management, and government regulators—individuals and groups are making decisions and taking actions to control the process. According to STAMP, accidents happen when these control actions are inadequate. This inadequacy can manifest in four ways: 1. A required control action to maintain safety is not provided. 2. An unsafe control action is provided, leading to a hazard. 3. A correct control action is provided too early, too late, or in the wrong sequence. 4. A correct control action is stopped too soon.

STAMP provides a framework and a language to analyze the entire system, including the organizational structure, safety culture, and management decisions that influence the behavior of the engineers and operators. It recognizes that the design of the management structure is just as critical to safety as the design of the software. An organization with poor communication channels or one that prioritizes schedule over safety is creating a flawed control structure that will inevitably lead to unsafe decisions at the engineering level.

Proactive Hazard Analysis with STPA

Key Insight 4

Narrator: While STAMP is the theoretical model, its practical application for engineers is a hazard analysis technique called STPA, or Systems-Theoretic Process Analysis. STPA is a proactive method used early in the design process to identify potential safety issues before the system is even built. It fundamentally changes the goal of hazard analysis.

Instead of brainstorming a list of possible component failures, STPA follows a structured, top-down process. First, it identifies the system-level accidents and hazards that must be prevented. Second, it models the system as a set of control loops and identifies the safety constraints necessary to prevent those hazards. Finally, and most critically, it analyzes the control actions within the system to determine how they could become unsafe. Using the four categories from STAMP, engineers can systematically identify "unsafe control actions" (UCAs).

For example, in designing an autonomous vehicle, a control action is "apply brakes." STPA would force engineers to analyze how this could be unsafe. Providing the action could be unsafe if the car applies brakes unnecessarily on a busy highway, causing a rear-end collision. Not providing the action is unsafe if a pedestrian steps in front of the car. Applying the brakes too late or for too short a duration is also unsafe. By identifying these UCAs, designers can then create specific requirements and design features—like better sensors, fail-safes, or operator warnings—to mitigate them. This is far more powerful than simply asking, "What if the brake pads fail?"

Conclusion

Narrator: The single most important takeaway from An Introduction to System Safety Engineering is that in our deeply interconnected and software-reliant world, safety can no longer be treated as an afterthought or a matter of component reliability. It must be designed into the system from the very beginning as a fundamental control problem. Accidents are not random events caused by broken parts or bad operators; they are systemic failures that arise from flawed designs, inadequate controls, and dysfunctional organizational structures.

Nancy Leveson's work challenges us to move beyond blame and look at the entire system. The next time you hear of a technological disaster, whether it's a software glitch grounding an airline fleet or an automated factory causing an injury, resist the urge to find a single scapegoat. Instead, ask the more difficult, and more important, question: Where did the control structure fail? What safety constraints were violated, and why were the systems—both technical and human—designed in a way that allowed it to happen? Answering that question is the first step toward building a genuinely safer future.

00:00/00:00