AWS IoT Greengrass V2 in Production — Managing Edge ML Deployments at Scale

April 9, 2026

Greengrass V2's component model changes how you think about edge ML deployment. Here's how to structure inference, connectivity, and OTA updates.

Recent Updates

AWS IoT Greengrass V2 in Production: Managing Edge ML Deployments at Scale

AWS IoT Greengrass V2 introduced a fundamentally different architecture from its predecessor: a component-based model that treats each piece of edge functionality as an independently versioned, independently deployable unit. For edge ML deployments, this shift matters a great deal. It means the inference runtime, the model artifact, the connectivity layer, and the data buffering service are all separate components — each can be updated without touching the others. It means your fleet of edge devices can run different component versions, and you can roll out a new model version to a subset of devices before committing to the full fleet.

This post covers how we structure production edge ML deployments on Greengrass V2, from component architecture to OTA update patterns to the operational monitoring that keeps a fleet of edge inference devices running reliably.

The Component Model

In Greengrass V2, every piece of functionality running on the device is a component. The Greengrass Nucleus — the core runtime — is itself a component. AWS publishes a catalog of pre-built components (Stream Manager, Shadow Manager, Token Exchange Service, Docker Application Manager, and others) that handle common edge infrastructure needs. Your application logic, including ML inference, is packaged as custom components.

Each component has a recipe — a YAML file that defines the component’s metadata, configuration schema, artifact locations, dependencies on other components, and lifecycle scripts. The lifecycle scripts define what happens at each stage: install (download dependencies, set up the environment), startup (start background services), run (the main process), shutdown (graceful stop), and recover (what to do if the component crashes). Greengrass manages the lifecycle of all components, restarts them on failure, and handles the dependency graph — if component B depends on component A, Greengrass ensures A is healthy before starting B.

For an ML inference deployment, we typically structure the application as four components:

1. The inference component. Contains the inference server (a small Python process, typically), the model artifact, and the runtime dependencies. The inference server listens on an IPC topic for inference request messages, runs the model, and publishes the result to an output IPC topic.

2. The trigger component. Handles the plant-floor integration: subscribing to OPC-UA signals from the CNC or PLC via the SiteWise OPC-UA collector, converting them to the message format the inference component expects, and publishing to the inference input IPC topic. Separating this from the inference component means you can update the integration logic independently of the model.

3. The stream manager component. The AWS-published Stream Manager component handles local buffering and export of inference results (images, model outputs, confidence scores) to S3. It manages its own retry logic and persistence buffer, so inference results aren’t lost during network outages. You configure it via the component configuration — export bucket, export batch size, maximum local buffer size, retry policy.

4. The shadow component. The AWS-published Shadow Manager component maintains a local device shadow, synchronized with IoT Core when connectivity is available. We use the shadow to track device operational state — current model version, inference enabled/disabled flag, per-class confidence thresholds — so that operators can view and update device configuration through the IoT Core console or API without needing SSH access to the device.

Component Recipes in Practice

A custom component recipe looks like this in structure (simplified):

RecipeFormatVersion: '2020-01-25'
ComponentName: com.example.inference
ComponentVersion: '1.4.0'
ComponentDependencies:
  aws.greengrass.StreamManager:
    VersionRequirement: '>=2.1.0'
    DependencyType: HARD

ComponentConfiguration:
  DefaultConfiguration:
    ModelS3URI: 's3://your-bucket/models/detector/v1.4/'
    ConfidenceThreshold: '0.75'
    InferenceTopicIn: 'local/inference/request'
    InferenceTopicOut: 'local/inference/result'

Manifests:
  - Platform:
      os: linux
      architecture: aarch64
    Lifecycle:
      Install:
        Script: |
                    pip3 install -r {artifacts:decompressedPath}/requirements.txt
      Run:
        Script: |
          python3 {artifacts:decompressedPath}/inference_server.py \
            --model-path {configuration:/ModelS3URI} \
            --threshold {configuration:/ConfidenceThreshold}          
    Artifacts:
      - URI: 's3://your-bucket/greengrass-artifacts/inference/1.4.0/inference.zip'
        Unarchive: ZIP

A few things worth noting. The ComponentConfiguration section defines the component’s configurable parameters with defaults — these can be overridden per-device or per-device-group at deployment time, which is how you set different confidence thresholds for different production lines without maintaining separate component versions. The Manifests section specifies platform — you can have separate manifests for aarch64 (Jetson) and x86_64 (industrial PC), with the same recipe, and Greengrass selects the right one based on the device architecture. Artifacts are pulled from S3 at install time, which means component updates don’t require manual file transfers to devices.

NVIDIA Jetson Considerations

Greengrass V2 runs as a systemd service on Jetson hardware (both JetPack 4.x and 5.x are supported). The inference component needs access to the GPU, which requires configuring POSIX system resource limits in the component recipe to allow device access to /dev/nvhost-* and /dev/nvidia*. Alternatively, running the inference component as a root process (controlled via the component recipe RequiresPrivilege flag) simplifies GPU access at the cost of a broader security posture — which is acceptable in isolated OT environments but worth evaluating based on your security requirements.

For model optimization on Jetson, we compile models to TensorRT format before packaging them as component artifacts. TensorRT optimization is architecture-specific (a TensorRT engine compiled for Jetson AGX Orin will not run on Jetson Orin NX), so the build pipeline needs to produce separate optimized artifacts per target hardware variant. The SageMaker processing job that packages the model artifact runs the TensorRT compilation step against a containerized Jetson simulation environment, or directly on a Jetson device registered as a SageMaker edge device. The resulting .engine file is uploaded to S3 and referenced by the component artifact URI.

ONNX Runtime is a practical alternative when TensorRT compilation per-variant is operationally unwieldy — ONNX Runtime with the TensorRT execution provider provides most of the inference performance benefit with a single portable model format. The tradeoff is slightly lower throughput compared to a natively compiled TensorRT engine.

OTA Updates and Fleet Management

The deployment workflow for a model update:

  1. A new model version is trained, evaluated, and approved in SageMaker Model Registry.
  2. A packaging step (Lambda function triggered by Model Registry approval) runs TensorRT compilation, bundles the model with the inference server code, uploads the artifact to S3, and creates a new Greengrass component version via the Greengrass API.
  3. A new Greengrass deployment is created targeting a staging device group — a handful of devices in a non-production context.
  4. The staging deployment runs for a defined evaluation window. A CloudWatch dashboard monitors inference latency, throughput, and error rate on the staging devices. An automated check compares these metrics to the previous component version’s baseline.
  5. If staging passes, the deployment is promoted to the full fleet device group. Greengrass handles the rollout: it installs the new component version alongside the running version, validates that the new version’s startup lifecycle completes successfully, then switches traffic to the new version. If the startup lifecycle fails (the component crashes before its health check passes), Greengrass automatically rolls back to the previous version.

This rollback behavior is one of the most operationally important features of Greengrass V2. A botched model deployment on a production line that rolls back automatically is a minor event. A botched deployment that takes the inference system offline and requires manual SSH intervention is a production incident.

The key to making rollback work reliably is the component’s recover lifecycle script. When Greengrass detects that a component has crashed (exited with a non-zero status), it executes the recover script before restarting the component. Use this to clean up any state that might cause the restart to fail — open file handles, stale lock files, partially initialized model weights in shared memory. A component that crashes cleanly and recovers deterministically is much easier to operate than one that requires manual cleanup.

Operational Monitoring

A Greengrass device fleet in production needs monitoring at two levels: device health and component-level application metrics.

Device health is handled through IoT Core’s fleet indexing feature. Greengrass publishes device connectivity and component health to IoT Core thing shadows; fleet indexing makes these queryable across the device population. A QuickSight dashboard pulling from a DynamoDB table populated by an IoT Core rule on health events gives operations teams visibility into which devices are online, which components are running, and which are in a degraded state.

Application metrics come from the inference component itself. We instrument inference components to publish a CloudWatch metric per inference call: latency (time from input received to output published), confidence score distribution, and a defect-detected flag. CloudWatch alarms on p95 inference latency and on sustained low-confidence periods (which often indicate that the input image conditions have changed — lighting, occlusion, part orientation) surface problems that device health monitoring doesn’t catch.

The combination of device-level health monitoring and application-level metric instrumentation closes the observability gap between “the device is running” and “the model is working correctly.” Both matter in production.

What Production Readiness Looks Like for an Edge Fleet

A Greengrass-based edge ML deployment is production-ready when:

That last point is underappreciated. In multi-device production environments, version drift — where different devices end up running different component versions due to partial deployment failures or manual interventions — is the source of a disproportionate share of hard-to-debug issues. Treating the Greengrass deployment configuration as the authoritative record of what should be running on each device, and making any deviation from that configuration an alarm condition, is the operational discipline that keeps edge fleets manageable over time.


If you’re deploying or planning to deploy edge ML inference on Greengrass V2 and want to talk through the architecture, we’re happy to get into the specifics with you.

Insight Authors

Looking for a partner with engineering prowess?

Learn how we've helped companies like yours.