Predictive Maintenance for Substation Equipment — A Production MLOps Architecture

October 22, 2025

Transformer failures are largely preventable. A production MLOps architecture for substation health monitoring — sensor ingestion to ranked field alerts.

Predictive Maintenance for Substation Equipment: A Production MLOps Architecture

Substation transformer failures are among the most disruptive and expensive events in grid operations. A failed power transformer in a transmission substation carries a replacement cost measured in millions, a procurement and installation lead time that can stretch over a year, and an outage impact that can affect large numbers of customers in the interim. Distribution transformer failures are less dramatic individually but collectively represent substantial unplanned maintenance costs across the grid.

The technology to predict most of these failures before they happen exists today. The challenge isn’t the model — it’s getting a complete, reliable, production-grade system in place that field crews actually use. This post walks through the full architecture we’ve built and deployed for substation predictive maintenance, including the parts that the machine learning tutorials don’t cover.

The Data Foundation

Predictive maintenance models for substation equipment are only as good as the sensor telemetry feeding them. Useful signals include:

Dissolved gas analysis (DGA) — hydrogen, acetylene, ethylene, and carbon monoxide dissolved in transformer oil are reliable early indicators of internal faults. Continuous online DGA monitoring has become standard on transmission-class transformers; distribution transformers typically still require periodic oil sampling.
Thermal telemetry — top oil temperature, winding temperature via fiber optic probes or RTDs, ambient temperature, and cooling system status (fan and pump operation).
Loading data — real-time load current and voltage, power factor, and harmonic content. Transformers operated consistently at high load factors age faster.
Vibration — for large power transformers, acoustic and vibration monitoring detects mechanical faults including loose windings and core issues that don’t appear in DGA until late-stage.
Maintenance history — last oil change, last inspection findings, age, and any previous fault events. This contextual data is often in a CMMS and not in the historian.

Getting all of this into a unified, consistent, ML-ready dataset is the hard part. In practice, DGA monitors use proprietary protocols (IEC 61850, DNP3, Modbus), temperature sensors report through the SCADA historian at varying intervals, load data comes from SCADA and metering systems on different schedules, and maintenance history lives in a work order system that may have free-text fields and inconsistent asset naming.

Ingestion Architecture

The pattern we’ve converged on for utility OT-to-cloud data pipelines uses AWS IoT SiteWise as the ingestion and normalization layer at the IT/OT boundary. SiteWise receives data from the SCADA historian via the SiteWise gateway (on-premises), normalizes it to a consistent asset hierarchy, and streams it to AWS IoT Core. From there, Kinesis Data Streams handles high-frequency telemetry with sub-minute latency; batch exports handle DGA and maintenance history on their respective schedules.

Everything lands in S3 in a raw zone partitioned by asset class and date. A Glue ETL pipeline runs nightly to:

Validate data quality — flag missing readings, out-of-range values, and sensor dropout sequences
Resample time-series to consistent intervals, with configurable forward-fill limits per sensor type
Join with asset metadata (equipment type, age, rated capacity, location) from the utility’s GIS system
Write validated, enriched records to a curated zone in Parquet

Feature Engineering

Transformer failure prediction benefits significantly from derived features that capture patterns not visible in raw time-series values. The features that have proven most predictive in our work:

Rogers Ratio and Duval Triangle — classical DGA interpretation methods that classify fault type based on gas ratios. Computing these as features lets the model leverage decades of transformer diagnostic knowledge encoded in IEEE and IEC standards.

Thermal aging index — the Arrhenius equation gives a principled way to convert time-at-temperature into equivalent insulation aging. Cumulative aging index is a strong predictor of end-of-life proximity.

Load cycling statistics — transformers that experience high-amplitude daily load swings age faster than those with steady load. Rolling statistics on load variability over 30 and 90-day windows capture this.

Time since anomaly — if DGA showed an elevated hydrogen reading 30 days ago that has since stabilized, that context matters. Features that encode recent anomaly history improve recall for intermittent fault patterns.

Feature engineering runs in SageMaker Processing as part of the training pipeline and is versioned alongside the model — a critical detail for reproducibility and debugging.

Model Architecture and Training

The production model is a gradient-boosted classifier (XGBoost) trained to predict the probability of a failure event in the next 30, 14, and 7 days. Three separate models, one per time horizon. Predicting at multiple horizons matters operationally: a 30-day signal allows procurement of a spare transformer; a 7-day signal triggers dispatch of a field inspection crew.

Training data is constructed by labeling the historical record: for each asset-day combination, the label is whether a failure event occurred within the prediction window. Class imbalance is significant — failures are rare — and we handle this with cost-sensitive training rather than oversampling, which tends to produce overconfident models on rare events.

SageMaker Pipelines orchestrates the full workflow: data extraction from the feature store, training job, model evaluation, threshold selection (precision-recall tradeoff is tunable per utility based on their operational preferences), and conditional model registration.

Retraining runs on a defined schedule — weekly for the 7-day model, monthly for the 30-day model — with automatic rollback if the new model doesn’t outperform the deployed version on a rolling validation window.

Serving and Operational Integration

The serving pipeline runs as a daily batch job. SageMaker batch inference processes the current feature snapshot for all monitored assets and produces a scored list ranked by failure probability. High-priority alerts (above a configurable threshold) are written to the utility’s outage management system via API. Operations teams receive the list in their existing mobile workflow tooling — no new interface.

Two design decisions here were critical to adoption:

Ranked alerts, not binary alarms. Field crews can’t respond to hundreds of daily alarms. The model output is a ranked list: the top N assets with the highest probability of near-term failure. Operations managers tune N based on available crew capacity. This framing matches how maintenance planning actually works.

Explanation alongside prediction. Each alert includes the top contributing features — “elevated acetylene trend over 14 days, load factor above 95% for 30 days.” Crew members who understand why an asset is flagged are far more likely to trust and act on the alert. We use SHAP values computed during batch inference to generate these explanations.

What Production Looks Like

A model that achieves 90% accuracy in a notebook will not achieve 90% accuracy in production a year later without active maintenance. The things that degrade performance in real deployments:

Sensor failures. When a DGA monitor goes offline, the feature vector for that asset changes — it now contains imputed or zero values. The model needs to handle this gracefully. We build separate imputation models for common sensor failure modes and include a sensor availability feature so the main model can adjust confidence.

Asset changes. When a transformer is replaced or rewound, its historical feature values no longer describe the new equipment. Asset change events need to be detected and the historical feature window for that asset needs to be reset.

Grid topology changes. Load patterns change when distribution circuits are reconfigured, new generation connects, or large customers change their demand profile. Features derived from load data can drift significantly after topology changes.

Monitoring for all of these is as important as the initial model. We instrument every serving run with feature distribution statistics, compare them to training-time distributions, and alert when drift exceeds thresholds. Drift detection is what triggers retraining — not a calendar schedule.

The Business Case

The business case for substation predictive maintenance is straightforward: unplanned transformer failures are far more expensive than planned replacements. When you add equipment cost, emergency procurement premiums, field crew mobilization, outage penalties, and customer impact, an unplanned failure in a critical substation can cost multiples of what a planned replacement would have. A predictive maintenance system that converts a meaningful fraction of unplanned failures into planned actions produces a compelling return.

The harder question is whether the data infrastructure investment is justified before you know the model will work. Our recommendation is always to start with a narrow, well-instrumented asset class where failure history is well-documented and sensor coverage is good — prove the loop works, quantify the value, then expand.

If you’re working through the data infrastructure or model architecture questions for a substation predictive maintenance program, we’re happy to talk through the specifics.

Team Nebulaworks