Connecting Legacy SCADA to AWS — An OT/IT Integration Architecture for Industrial AI

July 29, 2025

Most industrial AI projects stall before the first model — because OT data is inaccessible. Here's the architecture that changes that.

Connecting Legacy SCADA to AWS: An OT/IT Integration Architecture for Industrial AI

The conversation about industrial AI almost always starts in the wrong place. Manufacturers and utilities ask which machine learning model they should use for predictive maintenance, or whether they should build or buy an analytics platform, or how long it will take to get a production model running. These are the wrong first questions. The first question is always: can you actually get the data?

In most industrial environments, the answer is no — not without significant infrastructure work. Process data lives in operational technology systems (SCADA, historians, PLCs, DCS) that were designed for reliability and determinism in a closed-loop control environment, not for cloud connectivity or batch export. The IT/OT integration problem — getting data from those systems into a cloud data store where it can be used for analytics and ML — is where most industrial AI projects stall, often permanently.

This post describes the architecture we’ve converged on after building these integrations in manufacturing and utility environments. It’s not the only architecture that works, but it reflects the decisions that have held up in production environments with real operational constraints.

The Network Reality

Before any architectural decision, you need to understand the network topology you’re working with. Most industrial environments follow the Purdue Model (ISA-95/99): a hierarchy of network zones from the plant floor (Level 0-2: sensors, PLCs, SCADA) up through manufacturing operations (Level 3: historian, MES) to enterprise IT (Level 4) and the internet (Level 5).

The critical constraint is that traffic between zones is tightly controlled — typically unidirectional (plant floor to historian is one-way data push; historian to enterprise is controlled access) and firewall-filtered. Any integration architecture needs to work within these constraints. Solutions that require opening inbound firewall rules to the control network will not get through a competent OT security team and shouldn’t.

The architectural principle that follows: data flows up through the zone hierarchy; commands and queries flow down only under explicit, controlled conditions. For an ML data pipeline, this means we’re building a data extraction path from the historian (Level 3) to AWS via the enterprise DMZ, not a direct connection from the plant floor to the cloud.

The Component Stack

A practical OT-to-cloud integration for industrial AI typically involves these layers:

Historian (Level 3). This is where process data already lives — Ignition by Inductive Automation, OSIsoft PI (now AVEVA PI), Wonderware, or a proprietary historian depending on the plant vintage and vendor preference. The historian aggregates tag data from SCADA and PLCs and provides query interfaces (REST API, JDBC, ODBC, OPC-UA, or proprietary SDK). The historian is your primary data source; don’t bypass it.

Historian-to-cloud connector. The mechanism for moving data from the historian to AWS. Options:

AWS IoT SiteWise Gateway — runs on-premises (VM or hardware), connects to OPC-UA sources, normalizes to a SiteWise asset hierarchy, and publishes to AWS IoT Core. Good choice when OPC-UA is available and you want the SiteWise asset model abstraction.
MQTT broker + Kinesis bridge — for environments where the historian supports MQTT publish (Ignition does natively via the MQTT Engine module), a lightweight MQTT broker in the DMZ subscribes to configured tag groups and publishes to Amazon Kinesis Data Streams via the Kinesis Agent or a small custom publisher process. Lower overhead, more flexible, requires more configuration.
Custom polling agent — for historians with REST or JDBC/ODBC interfaces, a polling agent (Python service on a DMZ server) queries the historian on a schedule, buffers records, and pushes to Kinesis or S3. Less elegant but works with virtually any historian that has a query interface.

The right choice depends on the historian software, the tag count, required latency, and the plant’s IT support model. We’ve used all three in production.

Amazon Kinesis Data Streams (real-time path). High-frequency telemetry — tags sampled at 1-second or 1-minute intervals — flows through Kinesis for low-latency delivery to S3 and downstream processing. Kinesis handles bursts gracefully and provides the ordering and replay guarantees that matter when you’re trying to reconstruct temporal sequences for ML training.

Amazon S3 (storage layer). Everything lands in S3 — both real-time telemetry and batch historical exports. S3 is the foundation of the data lake. Partition strategy matters: organizing by asset class, asset ID, and date (e.g., s3://plant-data-lake/raw/compressors/{asset_id}/2026/04/) makes downstream query and ML feature engineering dramatically more efficient than a flat structure.

AWS Glue (transformation layer). ETL jobs run on a defined schedule (typically nightly for a batch path, or triggered by Kinesis using Glue Streaming) to validate, normalize, and enrich raw data for the curated zone. This is where you handle the messy reality of production data: sensor gaps, out-of-range values, inconsistent tag naming across plants, and the context joins (asset metadata from your asset register, maintenance events from your CMMS) that make raw telemetry useful.

Handling Historical Data

New integrations typically have a requirement to backfill historical data — years of historian records that predate the cloud connection and represent the training dataset for the first ML models.

Historical backfill is operationally different from real-time ingestion. The historian can typically serve historical data faster than real-time, but bulk extraction puts load on the historian server and the plant’s IT network. Coordinate with the OT team before running a large backfill, and structure the extraction to be pauseable and resumable.

The pattern that works well: partition the historical dataset by asset group and date range, submit extraction jobs in parallel across partitions (calibrated not to saturate the historian), checkpoint completion state to DynamoDB, and resume from the last completed checkpoint after any interruption. Large backfills for plants with extensive tag histories are tractable with this approach, typically completing within weeks rather than months.

Data Quality: The Real Work

Raw historian data is not ML-ready. In every industrial environment we’ve worked in, the data quality issues are significant and domain-specific:

Sensor failures. Sensors fail, connections drop, and calibration drifts. A temperature sensor reading 0°C in a furnace that operates at 900°C is a sensor failure, not a data point. Your ETL pipeline needs detection logic for common failure patterns: zero-value sequences, readings outside calibrated range, and sudden step changes that exceed physical plausibility.

Maintenance events. Equipment that was shut down for scheduled maintenance generates a period of atypical readings that doesn’t reflect normal operating behavior. If you train a model on this data without labeling maintenance windows, you’re training it to predict maintenance, not failure.

Tag naming inconsistency. Plants that have been operating for 20+ years and have had multiple control system upgrades often have inconsistent tag naming conventions. TIC-101.PV, TI_101, and TAG_101_TEMP might all refer to the same physical sensor in different system vintages. Building a tag-to-canonical-asset mapping table is tedious but necessary.

Time zone and DST issues. Historians record timestamps in local time, or sometimes UTC, and the documentation is often ambiguous. Daylight saving time transitions create duplicate or missing hours that corrupt time-series features if not handled explicitly. Standardize everything to UTC at the ingestion layer.

What You’re Building Toward

The ingestion architecture described here is not the end goal — it’s the foundation. Once you have a reliable, clean, ML-ready time-series dataset in S3, the machine learning work that was previously impossible becomes tractable. SageMaker can access the feature store directly for training runs. Data scientists can query the curated zone in Athena without waiting for manual data exports. Model retraining pipelines can run on a schedule against the current dataset without manual intervention.

More importantly, you have the infrastructure to run the model improvement loop that makes ML valuable over time: new data arrives continuously, models are evaluated against it, retraining fires when performance degrades, and updated models deploy to wherever they’re needed — the cloud, the edge, or both.

That loop is what turns a one-time data science project into a production capability. And it starts with getting the data out.

If you’re working through the OT/IT integration problem for an industrial AI initiative, we’re happy to talk through your specific environment.

Team Nebulaworks