Skip to main content

Fusion using AWS Batch and S3 object storage

Fusion simplifies and makes more efficient the execution of Nextflow pipelines with AWS Batch in several ways:

  1. No need to use the AWS CLI tool for copying data from and to S3.
  2. No need to create a custom AMI or create custom containers to include the AWS CLI tool.
  3. Fusion uses an efficient data transfer and caching algorithm that provides much faster throughput compared to AWS CLI and does not require the local copy of the data files.
  4. Replacing the AWS CLI with a native API client, the transfer is much more robust at scale.

A minimal pipeline configuration looks like the following:

process.executor = 'awsbatch'
process.queue = '<YOUR AWS BATCH QUEUE>'
aws.region = '<YOUR AWS REGION>'
fusion.enaled = true
wave.enabled = true

In the above snippet replace YOUR AWS BATCH QUEUE and YOUR AWS REGION with corresponding AWS Batch queue and region of your choice, and save it to a file named nextflow.config into the pipeline launching directory.

Then launch the pipeline execution with the usual run command:

nextflow run <YOUR PIPELINE SCRIPT> -w s3://<YOUR-BUCKET>/work

Replacing YOUR PIPELINE SCRIPT with the URI of your pipeline Git repository and YOUR-BUCKET with a S3 bucket of your choice.

To achieve best performance make sure to setup a SSD volumes as temporary directory. See the section SSD storage for details.