
Key takeaways
The reason most "downtime reduction" initiatives fail is not effort. It is that they start from the wrong place. A new dashboard goes up, an OEE number is published, a kaizen team is assembled, and within two months the production manager, the maintenance manager and the planner are each looking at a different downtime number and quietly assuming the other two are wrong.
That gap is real, and it is structural. Downtime is recorded by operators on the line in one system, captured by maintenance in a CMMS or paper log in another, and inferred by planning from missed production targets in a third. None of those three views see the same thing. The operator counts only stops they had time to log. Maintenance counts only stops that escalated to a work order. Planning counts only stops that broke the schedule. The total never matches and nobody trusts the average.
A 90-day playbook works because it does not try to reorganize the plant. It builds one trusted downtime register, routes every loss type to a single owner, and lets the volume of fixes do the rest.
The fastest way to lose the next 90 days is to instrument every line at once. Pick the three lines that account for the largest share of unplanned downtime, or, if you do not know that yet, the three lines whose operators complain most. Three lines is enough to see the pattern and small enough to fix in one quarter.
Every plant has an unwritten threshold below which a stop "does not count." On most lines that threshold is anywhere from 30 seconds to 5 minutes. Pick one, write it down, and apply it the same way on all three lines. The number matters less than the consistency. A 90-second threshold applied to every line beats a 0-second threshold applied selectively. For a deeper view on how those micro-stops add up, see the article on production loss analysis.
The register can live in a CMMS, in an OEE tool, in a spreadsheet, what matters is that maintenance, production and planning all read from it. Each entry needs four fields: line, start time, duration, and a free-text reason. Categorisation comes later. Trying to define the perfect 32-reason taxonomy on day one is how the project dies in week three. Start with free text; cluster after week four.
At the end of day 30, reconcile the register against the maintenance log and the planning gap report for the same three lines. Expect them to disagree by 20-40%. Walk the floor and ask the operators which of the three is closest. Publish the reconciled number as the downtime number for those lines. Do not republish a "corrected" number a week later. Trust beats accuracy in this phase.
By day 30 the register has between 300 and 1,500 entries. Cluster the free-text reasons into seven to ten loss types: changeover, micro-stops, minor mechanical, electrical/sensor, material starvation, quality reject, planned-but-overran, operator-related, and an "other" bucket. Anything bigger than ten types is too granular to act on; anything smaller than seven hides the real distribution.
This is the single biggest leverage point in the 90 days. Each loss type gets one named owner, a person, not a department. Changeover goes to the production manager. Minor mechanical and electrical go to the maintenance manager. Material starvation goes to the planner. The "other" bucket has no owner; it stays in the register but does not get worked until it shrinks below 5%. The point is not perfection, it is making sure no loss type has two owners (which means none) or zero owners (which means none).
Each owner defines one response loop. Maintenance gets an automatic work order when any minor-mechanical stop exceeds five minutes. Production gets a daily review of changeovers longer than the standard. Planning gets an exception report on starvation. These loops should be boring and repeatable. The goal is not novelty; it is making sure every recurring loss has a path to a fix that does not depend on anyone remembering to look.
In week 7 and 8, the top three recurring causes start showing up as fix candidates. These are usually unglamorous, a sensor that triggers a false stop, a guide rail that drifts after the night-shift changeover, a feed rate setting that gets reset every Monday. Fix three of them. Do not pick the most interesting ones; pick the most frequent. For more on how to identify and prioritise the right ones, see our guide on root cause analysis in manufacturing.
By day 60, downtime is being measured the same way by everyone. Day 60 to 90 is about changing what gets discussed in the morning meeting. Instead of "we lost 47 minutes to micro-stops yesterday," the question becomes "how many of those micro-stops were the same root cause we saw on Friday?" That shift is what turns the program from reactive to proactive.
For each of the top three recurring causes, define one leading indicator. For minor mechanical, it might be number of false-stop sensor events per shift. For changeover, mean changeover variance vs standard. For material starvation, hours of inventory cover for the bottleneck. Track these daily. When a leading indicator drifts, the response loop fires before the downtime happens. For a structured way to choose these, the piece on manufacturing KPIs is a useful reference.
The last two weeks are not about new fixes, they are about making sure the existing ones do not erode. A meaningful share of "fixed" downtime causes come back if the change is not codified into a standard operating procedure or a preventive task. Every fix from days 30-75 gets a written standard, a check, and an owner. This is also where the program connects to the wider preventive maintenance schedule: a confirmed root cause should drive a PM task, not just a one-time repair.
By the end of the 90 days, the three instrumented lines have:
The first three lines become the template for the next ten. The plant has not bought a new system, hired a new team, or rolled out a new methodology. It has just made the existing data agree with itself, which, for most mid-market plants, is the only thing standing between them and a step-change in OEE.
The playbook above is deliberately tool-agnostic, but it works best when the downtime register, the maintenance work orders and the OEE calculations all live in one system rather than in three. Fabrico is a manufacturing platform built specifically for that: real-time OEE monitoring and a field-ready CMMS share the same downtime events, the same asset hierarchy and the same loss taxonomy, so the reconciliation step in days 21-30 is automatic instead of manual. If you want to see how that looks on your own three lines, book a demo and we can walk through your live data.
The honest answer is that the structural part, getting one trusted downtime number, one owner per loss type, and a working response loop, fits comfortably in 90 days. The continuous improvement part never ends. Plants that try to do both at once usually do neither.
On the three instrumented lines, a double-digit reduction in unplanned downtime is realistic if the top three recurring causes get a permanent fix in that window. The plant-wide number moves more slowly because most plants only instrument three lines in the first quarter.
No. The playbook works with whatever is already in place, including spreadsheets. A unified OEE + CMMS platform removes the reconciliation step and shortens the cycle from "stop happens" to "work order opens", but the methodology does not depend on it.
Because what kills downtime programs is the cost of the data argument, not the cost of the fixes. Three lines is enough to surface the recurring loss types and small enough to fix the data layer in 30 days. Whole-plant rollouts almost always stall in the reconciliation phase.
Trying to define the perfect loss-type taxonomy before any data is in the register. Start with free-text reasons and cluster them at week four. Taxonomies designed in a conference room never survive contact with the floor.