Predictive Maintenance at Scale — Lessons from 18 Months in Production

November 8, 2025

Getting a predictive maintenance model working in a pilot is one thing. Keeping it accurate across 400 assets for 18 months is another. Lessons learned.

Predictive Maintenance at Scale: Lessons from 18 Months in Production

Predictive maintenance has a pilot problem. A model that predicts equipment failures on historical data in a 12-week proof of concept looks convincing — the metrics are good, the business case math is easy, and leadership approves a rollout. Then the model hits production and reality sets in: sensor deployments have gaps, assets behave differently than the training data suggested, the maintenance team doesn’t trust the alerts, and 6 months in, the model is catching maybe 40% of what it was supposed to catch.

We’ve been running a predictive maintenance system across hundreds of monitored assets in multiple manufacturing facilities for an extended production period. Here’s an honest account of what breaks in production and what we did about it.

What the Model Is Doing

For context: the system monitors rotating equipment — pumps, compressors, fans, and motors — across three process manufacturing facilities. Each asset is instrumented with vibration sensors (tri-axial accelerometers), temperature sensors on bearing housings, and a current transformer on the motor drive. Signals are collected at 5-second intervals and flow through a Kinesis-to-S3 pipeline into a SageMaker-backed MLOps platform.

The model predicts the probability that a given asset will experience a failure event within the next 14 days. An alert is generated when probability exceeds a configurable threshold. Alerts feed into the CMMS as high-priority work order candidates for the reliability engineering team to review.

At launch, the model performed well on the validation set. After an extended production run, performance on current data has improved beyond launch metrics on both recall and precision. That improvement didn’t happen automatically. Here’s what it took.

Problem 1: Model Drift Was Faster Than Expected

The first sign of trouble came within the first few months. Recall on recent production data had dropped well below the validation set performance. Two things had changed that the model wasn’t prepared for.

Equipment age. Three months of production is a small slice of an asset’s lifecycle. The training data skewed toward middle-aged equipment (because that’s what had the most history). Assets that were near end-of-life at deployment started failing in patterns that were underrepresented in training — more frequent, lower-amplitude vibration anomalies preceding catastrophic failure rather than the single large anomaly signature the model had learned.

Seasonal variation. One of the three facilities runs near capacity during summer months due to demand patterns. Higher ambient temperatures affect bearing performance, and the model had been trained mostly on shoulder-season data because the historical dataset didn’t extend far enough back to include two full summer cycles.

The fix was a combination of retraining on the expanded dataset (now including the failure events from months 1–3) and adding temperature-normalized vibration features that reduce the seasonal confound. We also implemented a formal model monitoring job that computes performance metrics on the trailing 30 days every week and logs them to CloudWatch. When metrics drop below thresholds, it triggers a retraining pipeline automatically rather than waiting for someone to notice.

This is the most important operational lesson: schedule-based retraining is not enough. Retraining on a calendar schedule (monthly, quarterly) means you’re flying blind between retraining runs. Performance monitoring needs to be continuous, and retraining needs to be triggered by evidence, not by the calendar.

Problem 2: Sensor Failures Changed the Feature Distribution

Several months in, we noticed the model was generating unusually high alert rates for a subset of assets at one facility — far more alerts than the maintenance team could realistically action. Investigation revealed that a number of sensors at that facility had degraded (intermittent dropout, not complete failure) and were producing noisy or zero-value readings at irregular intervals.

The model had been trained on clean data. When it saw zero-value vibration readings, it interpreted the absence of normal vibration signal as a pattern consistent with bearing failure — because a failing bearing sometimes produces reduced vibration before catastrophic failure. False positive rate on those assets spiked to unacceptable levels.

The solution required changes at two layers. At the data layer, we added a sensor health classification step to the Glue ETL pipeline: for each sensor-asset combination, a rolling statistical check flags sensors showing dropout patterns, and the asset’s feature record is annotated with a sensor availability indicator. At the model layer, we retrained with the sensor availability indicator as a feature — allowing the model to discount predictions made under degraded sensor conditions.

We also built a sensor health dashboard surfaced to the reliability engineers, who can now see sensor status alongside the model’s alert list. An alert on an asset with a degraded sensor is treated differently than an alert on a fully instrumented asset — it goes into a separate review queue rather than directly into the work order system.

Problem 3: CMMS Integration Was the Adoption Bottleneck

The model was working technically by month 4. Adoption by the maintenance team was not. After interviewing the reliability engineers, the problem was clear: the alert format didn’t match their workflow.

The model produced a CSV file of asset IDs and probability scores, delivered to a shared network folder each morning. The reliability engineers had to manually cross-reference asset IDs with the CMMS to find asset names and locations, evaluate whether a work order already existed for the flagged asset, and create a new work order if warranted. On a busy morning with a high volume of alerts, this was significant administrative work before anyone had touched a wrench.

We rebuilt the integration. Work order draft records are now created automatically in the CMMS for alerts above the high-confidence threshold, pre-populated with asset name, location, relevant sensor readings, and the model’s explanation (top contributing features via SHAP values). Reliability engineers review the draft queue — they still approve every work order before it’s dispatched, maintaining their authority over the maintenance schedule — but the administrative burden dropped dramatically.

This change had more impact on actual outcomes than any model improvement. Alerts that don’t result in timely maintenance actions don’t prevent failures regardless of model accuracy.

Problem 4: Equipment Changes Reset the Baseline

Month 7: the largest compressor at facility 2 — one of the highest-risk assets under the model’s monitoring — received a major overhaul. New bearings, rebalanced rotor, new seal. The vibration profile of the overhauled compressor was dramatically different from the pre-overhaul profile the model had been trained on. For the first 6 weeks after the overhaul, the model generated frequent false positive alerts on that compressor because its normal operating signature was below the baseline the model expected for a compressor of its age and duty cycle.

The fix: an equipment change event flag in the data pipeline. When a planned maintenance event is logged in the CMMS (overhaul, bearing replacement, rotor rebalance), the system resets the rolling feature history for that asset and applies a suppression window on alerts during the post-maintenance normalization period. The length of the suppression window is configurable by asset type.

This required coordination with the maintenance team to ensure CMMS maintenance events were being logged consistently — itself a process improvement that the reliability team had been trying to enforce for years and that the ML system gave them a concrete operational reason to prioritize.

What Sustained Production Looks Like

Unplanned downtime across the monitored asset population has declined meaningfully since deployment, driven by the shift from run-to-failure reactive maintenance to planned replacement at optimal intervals. Maintenance labor costs have improved as a result of the same shift.

Model recall on current production data has improved beyond launch performance. CMMS integration rates — the percentage of high-confidence alerts that result in a reviewed work order within 24 hours — have improved substantially since the workflow integration was rebuilt.

The most concrete measure of value is the number of failure events the maintenance team was able to act on before unplanned downtime occurred. The avoided downtime cost across those interventions has significantly exceeded the investment in the platform.

The Real Takeaway

A predictive maintenance deployment is not an ML project with an end date. It’s an operational system that requires active maintenance like any other piece of production infrastructure. The model needs monitoring, retraining, and improvement. The integrations with CMMS and OT systems need maintenance as those systems evolve. The maintenance team’s trust in and engagement with the system needs ongoing investment.

What makes it worthwhile — and what justifies the sustained investment — is that the system gets better over time as it accumulates more failure data, more feedback from the maintenance team, and more coverage as new assets are instrumented. Eighteen months in, we’re running a better system than we launched. That trajectory is what separates a production AI capability from a pilot that got promoted to production and then slowly degraded.

To discuss what a production predictive maintenance deployment looks like for your facility, reach out to our team.

Team Nebulaworks