When the Automation Fails

The Irony

We automate things because humans make mistakes. Then the automation removes the practice that kept humans sharp. When the automation eventually fails -- and it always does -- the humans are less prepared than they were before we automated.

This isn't a new observation. In 1951, Paul Fitts proposed a simple framework for dividing work between humans and machines: list what each is good at, and assign tasks accordingly. Humans are good at pattern recognition, flexible reasoning, and improvisation. Machines are good at repetitive computation, sustained monitoring, and consistent execution. Give each what they do best (Fitts, 1951).

This became known as the Fitts' list, and it shaped how systems were designed for decades. It's intuitive. It's also incomplete.

In 1983, Lisanne Bainbridge published a short paper called "Ironies of Automation." She showed what Fitts' list misses: the designer automates the routine tasks and leaves the operator responsible for the tasks that couldn't be automated -- the complex, ambiguous, and unpredictable ones. Simultaneously, the operator's skills degrade from disuse. When automation fails and those skills are needed most, the operator is least prepared to use them (Bainbridge, 1983).

She noted that the knowledge needed to operate manually "develops only through use and feedback about its effectiveness" and that automation can "camouflage system failure by controlling against the variable changes, so that trends do not become apparent until they are beyond control." The automation doesn't just replace the operator's actions -- it hides the information the operator would need to develop and maintain the skills to work without it.

Thirty-four years later, Strauch (2018) revisited Bainbridge's ironies and found them entirely unresolved. Identical operator errors recurred across decades despite industry-wide awareness of the problem.

What This Looks Like in Practice

Think about a diagnostic tool that collects information from fifteen different sources, correlates the data, and presents a conclusion. The operator runs the tool, reads the output, acts on it. The automation calls the same tool on a schedule. Everything works.

Then the tool breaks. Or the situation is novel enough that the tool's correlation doesn't produce a useful answer. Now the operator needs to do what the tool was doing: go to those fifteen sources, pull the data, interpret the raw output, and reason about what it means.

Do they know what those fifteen sources are? Do they know how to read the raw data without the tool's presentation layer? Do they understand why the tool correlates certain signals together?

The tool handled the how. Over time, the operator loses the why. They can run every tool in the toolkit but can't reason about the system when the tools don't give them an answer.

The most well-known illustration comes from aviation. When Air France 447's autopilot disconnected due to iced pitot tubes, the pilots lost reliable airspeed readings. The instruments gave contradictory data. The problem wasn't that the pilots couldn't fly manually -- it was that they couldn't diagnose what was happening when the instruments they relied on were unreliable. Oliver et al. (2017) found that the automation that made the system safe under normal conditions had eroded the crew's capacity to handle disturbances -- the "paradox of almost totally safe systems": the rarer the failure, the less prepared operators are to handle it.

When There's No Tool at All

Bainbridge's irony assumes the operator is doing the same job the automation did, just less well. But there's a worse case: when the automation fails and there's no proper tool for the manual path at all.

On January 31, 2017, a GitLab engineer was troubleshooting a database replication failure. After hours of unsuccessful debugging past midnight, they ran rm -rf on a database directory, intending to wipe the secondary database. They were on the primary. Around 300 GB of production data was deleted in seconds (GitLab, 2017).

The command was executed correctly. The engineer did what they intended to do. But there was no tool for this operation -- just a terminal and a procedure. No confirmation step, no restricted access, no guardrails. The terminal prompts barely distinguished between the primary and secondary servers.

I've seen the same pattern in my own work. An automated maintenance procedure was paused because the automation would restart a resource the customer couldn't tolerate restarting. The maintenance itself didn't require a restart -- but the automation bundled them together. So an operator had to perform the maintenance manually. No dedicated tool existed. The procedure hadn't been updated when the code changed -- it was buried on a sub-page, while all procedures were supposed to live on a single searchable page. It was only reachable through a link, so nobody found it when the code was updated. The operator executed it exactly as written and still caused an issue, because the procedure and the system had diverged.

The operator did nothing wrong. The automation was the tool, and when it couldn't run, there was nothing purpose-built to fall back to.

This is a different problem from Bainbridge's skill erosion, but it compounds it. The operator's skills have atrophied because the automation was handling the work. And the manual path they're forced onto doesn't even have proper tooling -- just raw commands and outdated documents.

What You Lose When People Leave

Both problems compound with a third: knowledge loss through turnover.

I'm experiencing this right now. I'm moving to a new team. I've done my best to transfer what I know -- written documentation, walkthroughs, recorded decisions and their rationale. But I know there are things I'm taking with me that I couldn't fully write down. The intuition for when a metric looks "off" before it triggers an alert. The knowledge of which service is the actual bottleneck during a specific traffic pattern. The memory of why we made a particular design decision three years ago, and what we tried first that didn't work.

Levallet and Chan (2018) found that implicit expert knowledge -- the kind that is most difficult to codify and most critical during non-routine situations -- is the most vulnerable to loss when someone leaves. Joining a new team means facing the flip side: the things everyone knows but nobody has written down, the failure modes that live only in the memories of the people who experienced them.

Hoffman (2008) found that what we call "tacit" knowledge is better characterized as "inert" knowledge: it can be articulated with the right scaffolding and prompting, but it's accessed only in specific contexts. The knowledge isn't untransferable -- organizations just rarely invest in the structured processes that would make transfer possible.

The person who understood why the diagnostic tool checked those specific sources -- which signals actually mattered during a particular class of incident, what the raw data meant in context, which correlations were reliable and which were coincidental -- that person just left the team. The code tells you what the tool does. It doesn't tell you why it was built that way.

The System Looks Fine Until It Doesn't

Cook (2000) ties all of this together, but his argument goes further than "complex systems are degraded." His central point is that the reliability we observe in complex systems is not a property of the system itself. It is produced, continuously, by the people operating it.

Operators notice things that monitoring misses. They work around known flaws. They compensate for design shortcomings that were never fixed. They carry context about the system's actual behavior -- not the behavior described in the architecture documents, but the real behavior, with all its edge cases and failure modes and things that only go wrong on the third Tuesday after a deployment. The system looks like it's working because the operators are making it work. The safety is their output, not the system's attribute.

This is what makes the four problems described in this article so dangerous. Each one is a way that load-bearing capacity quietly disappears from the system.

Skill erosion: the operators are still there, but less capable. The automation handled the routine work, and the routine work was where they built and maintained the understanding needed for the non-routine work. They're present but diminished.

Missing tooling: the operators are capable but unequipped. They know what needs to happen, but the path to doing it manually is unprotected -- raw commands, stale procedures, no guardrails. The automation was the tool, and now it's gone.

Knowledge loss: the operators are gone entirely. The person who understood why the system behaved a certain way, who remembered what was tried before and why it failed, who could look at a dashboard and see what it wasn't showing -- that person moved to another team, or left the company, or retired. Their knowledge left with them.

Inexperience: the new operators never had the skills to begin with. They arrived after the automation was already in place. The environment that would have built their understanding -- the manual work, the raw data, the feedback from getting it wrong and learning why -- no longer exists. The tenured operators developed their intuition through years of hands-on practice before the system was automated. The new ones have no equivalent path. Omron et al. (2018) showed that even in medicine, clinical experience alone cannot build expertise for rare events because the feedback loop is broken -- clinicians almost never learn about the cases they miss. Simulation-based training with deliberate practice was proposed as the alternative, compressing the learning that real-world experience cannot provide. The same applies here: when automation handles everything, the only way to build the skills needed for when it doesn't is through simulation. Most organizations don't invest in it.

In all four cases, nothing visibly changes. The metrics were never measuring what the humans were contributing. They were measuring the output of a system that included the humans and attributing all of it to the automation. The dashboards show the result of human adaptation and call it system performance.

So the safety margin narrows. The automation is handling everything. The dashboards are green. The people who were compensating for the system's flaws are less skilled, or less equipped, or gone -- or never learned how in the first place. And nobody notices, because the system's apparent reliability -- the only kind anyone measures -- remains unblemished.

Right up until the moment it isn't.

References

Bainbridge, L. (1983). Ironies of automation. Automatica, 19(6), 775--779. https://doi.org/10.1016/0005-1098(83)90046-8

Cook, R. I. (2000). How complex systems fail. Cognitive Technologies Laboratory, University of Chicago. https://how.complexsystems.fail/

Fitts, P. M. (Ed.). (1951). Human engineering for an effective air-navigation and traffic-control system. National Research Council.

GitLab. (2017, February 10). Postmortem of database outage of January 31. https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

Hoffman, R. R. (2008). Human factors contributions to knowledge elicitation. Human Factors, 50(3), 481--488. https://doi.org/10.1518/001872008X312152

Levallet, N., & Chan, Y. E. (2018). Knowledge loss and retention: The paradox of organizational forgetting. Journal of Knowledge Management, 23(1), 1--22. https://doi.org/10.1108/JKM-08-2017-0358

Oliver, N., Calvard, T., & Potočnik, K. (2017). Cognition, technology, and organizational limits: Lessons from the Air France 447 disaster. Organization Science, 28(4), 729--743. https://doi.org/10.1287/orsc.2017.1138

Omron, R., Kotwal, S., Garibaldi, B. T., & Newman-Toker, D. E. (2018). The diagnostic performance feedback "calibration gap": Why clinical experience alone is not enough to prevent serious diagnostic errors. AEM Education and Training, 2(4), 339--342. https://doi.org/10.1002/aet2.10119

Strauch, B. (2018). Ironies of automation: Still unresolved after all these years. IEEE Transactions on Human-Machine Systems, 48(5), 419--433. https://doi.org/10.1109/THMS.2017.2732506

Keyboard shortcuts