Nextflow in Production — Deploying Genomics Pipelines with Wave, Fusion, and AWS Batch

August 19, 2025

Running Nextflow reliably in production — at scale, with reproducibility guarantees — requires more than a working pipeline. Here's the stack.

Nextflow in Production: Deploying Genomics Pipelines with Wave, Fusion, and AWS Batch

Nextflow has become the dominant workflow orchestration framework in life sciences for good reasons: it abstracts compute infrastructure, handles parallelism naturally through the dataflow programming model, and has a rich ecosystem of community pipelines (nf-core) covering most common genomics workflows. A bioinformatician who knows Nextflow DSL2 can write a pipeline that runs locally, on an HPC cluster, and in the cloud with minimal modification.

What the Nextflow documentation and most tutorials don’t cover is what it takes to run these pipelines reliably in a production research environment — at scale, under regulatory constraints, with the operational discipline that separates a pipeline you run once from a pipeline you run a thousand times. This post covers the infrastructure decisions that matter, based on our experience deploying production Nextflow environments for pharma and biotech clients on AWS.

The Stack We Deploy

Our production Nextflow deployments for life sciences clients on AWS use:

Seqera Platform (formerly Nextflow Tower) for pipeline orchestration, monitoring, and management
AWS Batch as the compute executor
Wave containers for per-execution software provenance
Fusion filesystem for S3-native data access
AWS HealthOmics (where applicable) for long-term storage of genomic reference data
Terraform for all infrastructure provisioning

Each of these is a deliberate choice. Here’s the reasoning.

Why Seqera Platform

Nextflow pipelines can be launched from the command line with nothing more than the Nextflow binary and an execution environment. For research and development, that’s fine. For production, it creates problems: no centralized record of what was run and when, no way for team members to monitor running pipelines without SSH access to the head node, no governance over which pipeline versions are in use, and no integration point for workflow approval processes.

Seqera Platform addresses all of these. It provides a web interface for launching, monitoring, and reviewing pipeline runs — accessible to computational scientists, lab scientists, and quality reviewers without requiring command-line access. Run history is preserved with full metadata: pipeline version, input parameters, compute environment used, resource consumption, and run status.

For regulated research environments, Seqera Platform’s workspace and team features support role-based access control aligned to GxP requirements. Approved pipelines can be published to a shared workspace; scientists select from the approved list rather than running arbitrary pipeline code.

The API is well-designed and supports integration with existing LIMS and ELN systems — a pipeline run can be triggered programmatically when a sequencing run completes and results can be pushed back to the LIMS when the analysis finishes.

AWS Batch as the Compute Executor

Nextflow supports several compute executors on AWS: AWS Batch, EKS, and EC2 directly. For most life sciences workloads, AWS Batch is the right choice.

Batch manages the underlying EC2 fleet automatically — spinning up instances when jobs are queued, terminating them when the queue is empty — which means you pay only for compute you’re actually using. For genomics workloads that are bursty (a batch of samples arrives for analysis, compute is needed for hours or days, then nothing until the next batch), this is significantly more cost-effective than a persistent HPC cluster.

The configuration that matters:

Managed compute environments. We typically configure two Batch compute environments per research team: an on-demand environment for jobs with strict completion time requirements (samples urgently needed for a clinical decision, for example) and a Spot environment for routine analytical runs where a Spot interruption and restart is acceptable. Spot instances for genomics workloads (c5, r5, and m5 families) typically run at 30–70% of on-demand pricing.

Job queue priority. Multiple job queues with different priorities route urgent versus routine workloads. The Nextflow process.queue directive maps pipeline processes to the appropriate queue based on resource requirements and urgency.

Instance type selection. Genomics workloads are heterogeneous. Alignment and variant calling are memory-intensive; model training tasks may need GPU instances; file format conversion is I/O-bound. We configure Batch compute environments with optimal instance fleets that cover the range of instance types appropriate for the workload, letting Batch select the most cost-effective available instance that meets the job’s resource request.

Container image registry. All pipeline containers are stored in Amazon ECR private registries. ECR replication across regions ensures containers are available near the Batch compute regardless of which AWS region is primary.

Wave Containers for Reproducibility

Software environment reproducibility is a significant challenge in bioinformatics. A pipeline that produces different results on two runs — because a dependency was updated, or a container image was rebuilt with a different base — is not fit for use in regulated research.

Wave is Seqera’s container augmentation service that solves this problem. Instead of relying on container images tagged with a version (which can be overwritten) or by digest (which is correct but operationally cumbersome), Wave builds containers on-demand from a Conda or Spack package specification or a Dockerfile, caches the built image, and returns a digest-pinned reference that is immutable.

In practice: a Nextflow pipeline that uses Wave doesn’t reference a Docker Hub image that might change. It references a Wave-built container whose content is deterministic from the package specification. The same pipeline, run against the same Wave specification 6 months from now, will use the same software environment. That’s the guarantee that regulated research requires.

Wave also supports container fusion with Fusion filesystem, which eliminates an S3 staging step that has historically been a significant performance bottleneck.

Fusion Filesystem for S3-Native Data Access

Traditional Nextflow pipelines stage input data to local storage before processing and stage output data back to S3 after processing. For large genomics datasets — whole genome sequencing FASTQ files can be very large per sample — this staging step adds significant time and cost: data is copied to the local EBS volume (storage cost + IOPS cost), processed, and then copied back to S3 (egress cost).

Fusion is a FUSE-based filesystem layer that makes S3 buckets appear as a local filesystem to the pipeline process. Input data is read directly from S3 with prefetching, and output data is written directly to S3 without staging. The local storage requirement drops to a small buffer rather than the full dataset.

In practice, pipeline wall-clock time for I/O-intensive stages like FASTQ processing and BAM sorting is meaningfully faster when using Fusion compared to traditional staging. Storage costs on EBS drop to near zero for the data staging layer.

Configuration is minimal — Fusion is enabled via the fusion.enabled = true setting in the Nextflow config and the appropriate AWS permissions on the Batch job role. It requires Wave to be enabled (Wave handles the Fusion layer injection into the container).

Infrastructure as Code

All of this infrastructure — the Batch compute environments and job queues, the ECR registries, the VPC and subnet configuration, the IAM roles, the S3 buckets and lifecycle policies, the Seqera Platform configuration — is provisioned and managed via Terraform. Nothing is configured through the AWS console.

This matters for several reasons:

Reproducibility. A Terraform-managed environment can be reproduced exactly in a different AWS account or region. Development, staging, and production environments are deployed from the same Terraform configuration with environment-specific variable files.

Change control. All infrastructure changes are made via pull request, reviewed, and applied via a CI/CD pipeline. The audit trail of what changed, when, and who approved it is in Git. For regulated environments, this satisfies the change control requirements for the compute infrastructure that runs validated pipelines.

Documentation. Terraform configuration is, itself, documentation of the infrastructure. Auditors reviewing the compute environment can read the Terraform code to understand what’s deployed — far more reliable than documentation that might be out of date.

Operational Patterns That Matter

Beyond the technology stack, a few operational practices distinguish stable production Nextflow environments from ones that generate constant support requests:

Pin everything. Pipeline revisions (Git commit SHA, not branch name), container images (digest, not tag), and Nextflow version should all be pinned in production configurations. A pipeline that ran correctly last month should run identically next month without any changes.

Design for Spot interruption. Nextflow’s Batch executor handles Spot interruptions gracefully by default — interrupted tasks are retried. But pipelines need to be designed with checkpointing in mind: long-running processes that can’t be checkpointed should be run on on-demand instances; short-running processes that can be cheaply restarted are appropriate for Spot.

Separate compute environments by workload type. Don’t mix alignment jobs (high memory, moderate CPU) with variant calling jobs (high CPU, moderate memory) in the same compute environment if you can avoid it. Dedicated compute environments with appropriately sized instance types improve cost efficiency and reduce queue contention.

Monitor resource consumption, not just job status. Seqera Platform and CloudWatch provide resource utilization data per task. Regularly reviewing CPU and memory efficiency across pipeline runs reveals over-provisioned resource requests (common in pipelines copied from HPC origins) and under-provisioned requests that cause job failures. Optimizing resource requests can meaningfully reduce Batch costs for mature pipelines.

What Production Readiness Looks Like

A Nextflow pipeline is production-ready when you can answer yes to all of these:

Can any authorized team member launch the pipeline without command-line access to the infrastructure?
If a compute instance fails mid-run, does the pipeline recover automatically and complete correctly?
Is the software environment that ran the pipeline 6 months ago reproducible today?
Is there an immutable record of what ran, when, with what inputs, and what outputs were produced?
Can the infrastructure be rebuilt from code in a new AWS account in a matter of hours?
Is there a tested procedure for updating the pipeline that includes validation of the updated version before it runs in production?

Most pipelines that are described as “in production” can’t answer yes to all of these. The ones that can are the ones that are still running reliably two years after deployment.

Nebulaworks is a Seqera partner with production experience deploying Wave, Fusion, and AWS Batch-based Nextflow environments in pharma and biotech settings. If you’re working on a production genomics pipeline deployment, we’d be glad to talk through your requirements.

Team Nebulaworks