
Key takeaways
A failure at 2pm has a maintenance manager, a production manager, a planner, three technicians and a supervisor within walking distance. A failure at 2am has the operator on the line, an on-call technician 45 minutes away, and a supervisor who may or may not answer the phone. The fix is rarely harder than the daytime equivalent. The decision around the fix is.
The cost shows up two ways. The first is over-escalation, every soft signal becomes a 3am phone call, the on-call technician burns out, and within a quarter the team stops escalating at all. The second is under-escalation, a real critical failure waits until the morning shift, by which point the lost production has compounded.
A runbook is the bridge. It defines, in advance, which failure types get escalated, when, and how. The on-shift operator no longer has to guess.
Any failure with a safety implication (hot work, hazardous material, machine guarding) or a regulatory implication (food contact, pharma containment) is a one-step escalation: stop the line, call the supervisor, call EHS. There is no triage; there is no "wait an hour." The runbook makes this the first page so the operator never has to decide.
An asset that has stopped the line. The operator runs a 60-second triage (covered below) before calling the on-call technician. The triage either confirms the failure type or buys five minutes of context for the call.
The line is producing, but quality is degrading, reject rates climbing, dimensions drifting. The runbook says: log it, capture three samples, mark "Watching" in the handoff, and only escalate if rejects exceed a defined threshold within an hour. Most quality drifts at night stabilise on their own or wait until day shift.
A new noise, a slightly different vibration pattern, an intermittent warning light. These get logged and added to the next handoff under "Watching." No escalation. The article on the preventive maintenance schedule covers how those soft signals feed back into PM tuning.
Before any class-2 escalation, the operator runs a fixed triage. The point is not to fix the failure; it is to make the inbound call to the on-call technician useful instead of panicked.
These five fields go to the on-call technician in writing before the phone call. The call itself becomes a 90-second conversation about action, not a 15-minute conversation about what is happening. This is the single highest leverage step in the runbook.
Each failure class has one tree. The tree is short on purpose.
Stop line → call supervisor → call EHS → log to CMMS with class-1 flag. No further escalation needed; supervisor takes over.
Run 60-second triage → call on-call technician → log to CMMS with triage data attached. If on-call technician unreachable after 10 minutes, call backup. If backup unreachable after another 10 minutes, call supervisor. After 30 minutes total of no contact, default action: maintain the line in safe-stop state, document for day shift.
Capture three samples → log to CMMS as "in progress, quality watching" → continue running for one hour → if reject rate above threshold, call on-call technician. If reject rate stabilises or improves, hand off in writing to day shift with the sample data attached. The article on root cause analysis covers the sample-capture protocol that makes day-shift investigation possible.
One-line note in handoff under "Watching." No escalation. Reviewed at morning standup.
Most of the triage data is already in the system if the plant runs a unified OEE + CMMS stack. Asset ID, time of failure, prior "Watching" entries, recent OEE events on the asset, all available without the operator typing. The runbook becomes a structured wizard at the asset station, not a paper form.
The escalation tree itself can be automated: a class-2 entry in the CMMS pages the on-call technician with the triage data; if no acknowledgement in 10 minutes, the backup pages. The operator's job becomes confirming the classification and triggering the flow, not running the protocol from memory. See work order management systems for the underlying mechanics.
The risk in a night-shift runbook is over-engineering it on day one. Classes 3 and 4 are tempting to fragment into 12 subcategories; resist this. Four classes, one tree per class, one triage. Anything more complex collapses in week three when the night operator is tired.
A realistic adoption curve:
The runbook works on paper. What changes when the OEE and CMMS systems live in one platform is that the triage fields auto-populate, the "Watching" history from earlier shifts is one tap away, and the escalation tree can fire pages without the operator having to look up numbers in the dark. Fabrico is built so the night-shift operator and the on-call technician see the same row at the same time. To see a runbook structured against your line, book a demo.
The classes and the triage are universal. The escalation tree (who to call) is site-specific. Plants with multiple sites should keep the structure identical and parameterise only the contact list. Inconsistent classifications across sites make group-level rollups useless.
That is the most useful disagreement to log. Every reclassification (operator said class 2, technician calls it class 3) is a training signal. Two reclassifications on the same operator usually mean the runbook description for that class needs sharper wording, not that the operator is wrong.
Two numbers: average minutes from failure to first contact with the on-call technician (target: under 5 minutes for class 2), and the rate of "wrong class on first call" (target: declining month over month). A runbook that never triggers reclassification is suspiciously perfect, usually it means class 3 and 4 are over-escalating into class 2.
No. Weekend on-call is just a different contact list. The classes and the triage are the same. Plants that maintain two separate runbooks usually let one drift while updating the other.
The "Watching" column going stale because no one reviews the handoff entries. If the soft-signal log is not read by day shift the next morning, operators stop putting effort into it within a quarter, and the early-warning channel collapses. The runbook depends on the handoff being read.