The Dependency You Inherited at 2am
A production system fails outside business hours. The fix requires understanding a component integrated under pressure and never properly documented. The on-call engineer pays the cost of decisions they never made.
Summary: A system breaks at an unhelpful hour. The person on call is not the person who built the part that broke. The person who built it is not on the team anymore, or is asleep, or never wrote the runbook because the system was supposed to be temporary. The cost of every shortcut taken six months ago is now being paid in real time by someone who did not take any of them.
Pattern
A production system fails outside business hours. The fix requires understanding a component that was integrated quickly during a sprint that has since been forgotten. The person who integrated it is unavailable. The documentation does not exist or does not match the current state. The on-call engineer has to reverse-engineer the component from logs while it is still on fire.
What tends to happen
The integration was added under pressure. A vendor needed to be wired in, or a one-off data pipeline needed to be stood up, or a workaround needed to be patched in. The work was done. It worked. The team moved on.
In the months that followed, the component continued to work. Nobody had reason to revisit it. Documentation was not written because nothing was breaking. The institutional memory of how the component was wired in lived in one person’s head.
That person is no longer the one being paged.
The on-call engineer reads logs they do not fully understand, talks to a vendor support agent who has no context on the integration, and tries to construct a working theory in real time. The mean-time-to-recovery is determined less by the severity of the actual fault and more by how much of someone else’s mental model has to be rebuilt before any action can be taken.
Why it fails
The shortcut taken at integration time was not technically wrong. It was procedurally fragile. It assumed that the person who understood the system would always be reachable, or that the system would be replaced before its understanding mattered, or that the failure mode it was designed to handle was the only failure mode it would ever face.
None of those assumptions are unreasonable in the moment they are made. They are all unreasonable in aggregate, across a team’s lifetime.
Human layer
On-call work has a specific cognitive shape. The engineer is operating with reduced sleep, reduced context, and elevated time pressure. The cost of every undocumented assumption made by someone else is now being paid by someone with the worst possible toolkit for paying it.
This is not a fairness problem. It is a design problem. The system was implicitly designed assuming that the person who built it would always be the one repairing it.
System layer
The system has no map of its own integrations from the perspective of someone who has never seen them. No reading order for the runbooks. No clear list of which components have known fragility versus which are well-understood. The on-call engineer does not have a way to triage what they do not know before the incident forces them to discover it.
What it costs
The direct cost is the duration of the outage. The indirect cost is what the engineer does after the fire is out.
Sometimes they write the documentation that should have existed. Sometimes they file a ticket to refactor the fragile component. Sometimes they say nothing because it is 4am and they have to be in a meeting in five hours.
The category of cost that goes unmeasured is the slow erosion of trust between engineers and the systems they inherited. Each unfamiliar component that breaks at 2am makes the next on-call rotation a little more dread-shaped.
Reduction path
The first intervention is to treat documentation not as a project deliverable but as an integration deliverable. A component is not done when it works. It is done when someone who has never seen it can debug it at 2am.
The second is to identify the components in a system that depend on a single person’s understanding, and to either remove that dependency or label it clearly. Knowing where the fragility lives is a precondition for managing it.
The third is to recognize that the engineer paged at 2am is not just being asked to fix the system. They are being asked to absorb every assumption that anyone else ever skipped writing down. Reducing that burden is a system-design problem, not an individual-resilience problem.