When Nobody Decides It's Fine

Last month, an operator on my team installed the wrong TLS certificate onto a production resource. A single customer lost access for several hours. The operator made no mistake -- they followed the standard operating procedure exactly as written, ran the command it specified, and the system accepted the input without complaint. The SOP was stale. The underlying code had changed months earlier, but the procedure still called for manually specified parameters that no longer matched. The system was designed to accept arbitrary input, so it did exactly what it was told, even when what it was told was wrong.

I extended the system a year ago to support additional functionality. I didn't know the SOP existed -- it was on a hidden subpage, not linked from where all other procedures live. From my perspective, the system was only ever invoked through automation, which I had updated correctly. A legacy manual path predating my time on the team also called the same system with hand-typed parameters. When I changed the code, the SOP became stale, and no mechanism existed to surface that coupling.

Vaughan (1996) calls this structural secrecy: information exists in the system but is architecturally hidden from those who need it. The stale SOP, the silent default for the missing input, the absent tooling, the original design shortcut -- these are latent failures (Reason, 1990), present in the system but dormant until the right combination of conditions surfaces them.

The Fix Was Fast. The Problem Was Slow.

When the incident happened, I was paged that night as backup. Because I knew the system inside out, I traced the issue within minutes, determined the mitigation, and handed it to the primary operator to execute while I worked on the permanent fixes. Within two hours, I had updated the SOP, changed the system to read directly from the authoritative database instead of accepting arbitrary input, built a tool that performs the operation automatically, and updated the SOP again to reference the tool. Each step removed more human involvement from the error-prone part of the process. The final state: even if someone runs the raw command, the system itself no longer accepts arbitrary input. The failure mode is designed out, not trained out.

I didn't need to brainstorm or wait for consensus. I had been advocating against arbitrary input in our SOPs for three years. I had the expertise, and AI-assisted development collapsed what would have been days of implementation into hours. Klein and colleagues found that in 80% or more of cases, professionals with expertise make critical decisions "based on meaningful situational pattern recognition grounded in experience -- rather than by conscious deliberation" (as cited in Patterson, 2017, p. 105). Three years of thinking about the class of vulnerability meant I recognized this instance immediately and already knew the fix.

But the fix being fast doesn't mean the problem was simple. It means the analytical work had already been done -- over three years of thinking about it, plus six months of building the certificate management system itself. The two-hour resolution was three-years-and-two-hours fast.

Why Nobody Fixed It Earlier

I had been flagging this exact class of vulnerability for over three years. Our SOPs routinely called for operators to run raw commands with arbitrary parameters against production infrastructure. I documented the risk, proposed solutions -- tooling to eliminate manual input, system changes to read from authoritative sources -- and raised it repeatedly. The response was always the same: acknowledged, deprioritized, deferred.

The issue never became the squeaky wheel because it had never been attributed to a visible failure -- though it had caused plenty. The causal chain was never traced back to arbitrary input. This is the availability heuristic operating at the organizational level: we judge the probability of an event by how easy it is to think of examples (Goldstein, 2019). A risk that has never been named produces no examples to recall, so it feels improbable.

Vaughan (1996) calls the broader pattern the normalization of deviance. When a risky practice repeatedly does not cause harm, it becomes redefined as acceptable. Each successful execution reinforces the belief that it's safe. Once a configuration and "its known lack of perfection" is accepted by higher levels, it "tends to be a little bit of an umbrella for subsequent decisions" (NASA engineer, as quoted in Vaughan, 1996, p. 112). No one decided this was fine. It just never broke, so it stayed.

Rasmussen (1997) provides the structural explanation. Systems operate within boundaries -- economic failure, unacceptable workload, and safety. Two gradients constantly push the system: management pressure toward efficiency and the effort gradient where people naturally find the easiest path. Wickens (2014) formalizes this as a decision tree: one path is safe but effortful (build the tooling), the other is risky but low-effort (defer). When the risk is abstract and the effort is concrete, the low-effort path wins.

My three years of warnings were attempts to create what Rasmussen calls a counter-gradient -- pressure back from the boundary. But counter-gradients only work as continuous force. The moment you stop pushing, the drift resumes. Rasmussen argues that "the most promising general approach to improved risk management appears to be an explicit identification of the boundaries of safe operation together with efforts to make these boundaries visible to the actors" (p. 192). That's what I tried to do with documents and proposals. It didn't work until the boundary was crossed.

The Humans Were the Safety System

Cook (2000) writes that "failure free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance." The safety of running arbitrary commands came not from the commands being correct, but from operators exercising judgment -- adapting the arguments on the fly when something seemed wrong, then updating the SOP with whatever worked. Cook calls practitioners "the adaptable element of complex systems," and this adaptability is recognition-primed decision-making (Klein, 1998) in action: experienced operators pattern-match against prior encounters and adjust.

The same flexibility that made the system dangerous also made it survivable -- until this incident, where the failure was invisible at the point of execution. The input was not wrong; it was incomplete. The system accepted it silently, defaulting a missing parameter rather than flagging it. The operator had no signal that anything was missing, and no opportunity to exercise judgment. In Rasmussen's terms, this is an invisible boundary: the system was already in a degraded state, but nothing made the proximity to failure visible to the actors.

The Investigation Almost Failed Too

When the incident happened, the blameless investigation almost stopped too shallow. A post-incident analysis that identifies "system accepted arbitrary input" as the root cause produces a narrow fix -- hardening that one interface -- while the deeper question goes unasked: why the organization normalized running arbitrary commands against production systems in the first place. This is what Dekker (2017) calls the old view of human error: find the broken component, replace it, declare the system safe. Vaughan (1996) termed the broader pattern the normalization of deviance: when a risky practice repeatedly does not cause harm -- or when the harm it causes is attributed to other factors -- it becomes redefined as acceptable, and the systemic conditions that made it unremarkable persist unchallenged. This requires the kind of systemic thinking that Cook (2000) argues is essential but that organizational pressure toward rapid closure discourages. When investigations stop at the proximate cause, the systemic condition persists, and the same class of failure recurs under different surface circumstances. The system is designed to produce closure, not understanding -- and the design of the investigation process itself becomes a latent failure.

I pushed the investigation to go deeper. The broader finding was initially accepted -- then narrowed, then dropped entirely because it felt too broad. The organization's own investigation process reproduced the pattern it was supposed to interrupt.

When Operators Default to Rules

Butler et al. (2021) found that firefighters were less likely to exercise operational discretion -- departing from SOPs -- in high-stress emergency scenarios than in routine ones, even when the emergency was precisely the condition that licensed discretion. They suggest this reflects acute stress shifting cognition toward rule-based processing.

I see the same pattern in DevOps, but I'm not convinced stress is the primary mechanism. What Butler's high-stress scenarios also share is unfamiliarity -- emergencies are by definition non-routine, meaning the operator has fewer prior encounters to draw on. Their scenarios confound these variables: a high-rise fire falls within the domain firefighters are trained for; children in a sinkhole does not. Stress and unfamiliarity co-vary in their design.

In my experience, operators default to SOPs in unfamiliar situations not because stress degrades their capacity to think, but because they don't trust their own judgment without the pattern library to recognize what they're looking at. Staal (2004) describes the resulting vicious circle: perceived poor performance produces an emotional response that further decreases performance, deepening the perception. The fix isn't stress inoculation -- it's building the pattern library through varied exposure so the operator recognizes the situation and trusts their judgment when it matters.

The Effort Equation Has Changed

The first investigation on a similar issue, five years ago, improved the SOP but stopped short. It replaced a fully manual process with an automated system that still accepted arbitrary input. The investigation stopped at the proximate cause -- the stale procedure -- while the systemic condition persisted unchallenged. The deeper question was invisible because the practice had been normalized: arbitrary input had never been named as the problem, so it produced no examples to recall, and a risk that produces no examples feels improbable.

The system could have been fully automated from the start, but the cost made manual input the accepted shortcut -- Wickens's decision tree in action. What has changed is that AI-assisted development has made full automation the low-effort option. The gradient that once pushed toward manual shortcuts now pushes toward proper tooling -- but only because AI-assisted development recently made full automation the low-effort option.

But only for those whose mental model of effort has updated. When decision-makers still calibrate against the old cost -- weeks of implementation, not hours -- a two-hour fix looks suspicious rather than correct. The speed itself becomes grounds for skepticism, and the gradient that should now push toward safety gets blocked by an outdated intuition about how long good work takes.

Cook (2000) reminds us that "complex systems run in degraded mode" and "catastrophe is always just around the corner." The system doesn't need to be visibly broken to be one perturbation away from failure. A tool that encapsulates the operation would survive any code changes -- the SOP becomes frozen in time, immune to drift. Better still, the system itself could read the data it needs directly, eliminating the manual procedure entirely.

The absence of failure in a system drifting toward its boundary is not evidence of safety. It is evidence that the boundary has not yet been crossed.

References

Butler, P. C., Honey, R. C., & Cohen-Hatton, S. R. (2021). Decision making within and outside standard operating procedures: Paradoxical use of operational discretion in firefighters. Human Factors, 63(8), 1378--1393. https://doi.org/10.1177/00187208211041860

Cook, R. I. (2000). How complex systems fail. Cognitive Technologies Laboratory, University of Chicago. https://how.complexsystems.fail

Dekker, S. (2017). The field guide to understanding human error (3rd ed.). CRC Press.

Goldstein, E. B. (2019). Cognitive psychology: Connecting mind, research, and everyday experience (5th ed.). Cengage Learning.

Klein, G. (1998). Sources of power: How people make decisions. MIT Press.

Patterson, R. E. (2017). Intuitive cognition and models of human-automation interaction. Human Factors, 59(1), 101--115. https://doi.org/10.1177/0018720816659796

Rasmussen, J. (1997). Risk management in a dynamic society: A modelling problem. Safety Science, 27(2), 183--213. https://doi.org/10.1016/S0925-7535(97)00052-0

Reason, J. (1990). Human error. Cambridge University Press. https://doi.org/10.1017/CBO9781139062367

Staal, M. A. (2004). Stress, cognition, and human performance: A literature review and conceptual framework (NASA/TM-2004-212824). National Aeronautics and Space Administration.

Vaughan, D. (1996). The Challenger launch decision: Risky technology, culture, and deviance at NASA. University of Chicago Press.

Wickens, C. D. (2014). Effort in human factors performance and decision making. Human Factors, 56(8), 1329--1336. https://doi.org/10.1177/0018720814558419

Keyboard shortcuts

AI-Driven Development