AWS Batch
Fusion Snapshots enable checkpoint/restore functionality for Nextflow processes running on AWS Batch Spot instances. When a Spot instance interruption occurs, AWS provides a guaranteed 120-second warning window to checkpoint and save the task state before the instance terminates.
Seqera Platform compute environment requirements
Fusion Snapshots require the following Seqera Platform compute environment configuration:
- Provider: AWS Batch
- Work directory: S3 bucket in the same region as compute resources
- Fusion Snapshots (beta): Enabled
- Config mode: Batch Forge
- Provisioning model: Spot
- AMI: See Selecting an AMI for details
- Instance type: See Selecting an EC2 instance for details
Fusion Snapshots work with sensible defaults (e.g., 5 automatic retry attempts). For configuration options, see Advanced configuration.
Selecting an AMI
Fusion Snapshots require instances running Amazon Linux 2023 (which ships with Linux Kernel 6.1) and an ECS container-optimized AMI for optimal performance.
Seqera Cloud
Seqera Cloud AWS Batch compute environments use an ECS container-optimized AMI by default. No additional AMI configuration is required.
Seqera Enterprise
Specify an Amazon Linux 2023 ECS-optimized AMI for your region when creating your compute environment.
To find the recommended AMI:
-
Retrieve the application configuration:
export REGION=<AWS_REGION>
aws ssm get-parameter --name "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended" --region $REGIONReplace
<AWS_REGION>with your AWS region (for example,eu-central-1).The output for the
eu-central-1region is similar to the following:{
"Parameter": {
"Name": "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
"Type": "String",
"Value": "{\"ecs_agent_version\":\"1.88.0\",\"ecs_runtime_version\":\"Docker version 25.0.6\",\"image_id\":\"ami-0281c9a5cd9de63bd\",\"image_name\":\"al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64\",\"image_version\":\"2023.0.20241115\",\"os\":\"Amazon Linux 2023\",\"schema_version\":1,\"source_image_name\":\"al2023-ami-minimal-2023.6.20241111.0-kernel-6.1-x86_64\"}",
"Version": 61,
"LastModifiedDate": "2024-11-18T17:08:46.926000+01:00",
"ARN": "arn:aws:ssm:eu-central-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended",
"DataType": "text"
} -
Identify the
image_idin your output (e.g,ami-0281c9a5cd9de63bdin the above example) and set in the Advanced options > AMI ID field when you create your Seqera compute environment.
Selecting an EC2 instance
AWS provides a guaranteed 120-second reclamation window. Select instance types that can transfer checkpoint data within this timeframe. Checkpoint time is primarily determined by memory usage. Other factors like the number of open file descriptors also affect performance.
When you select an EC2 instance:
- Select instances with guaranteed network bandwidth, not "up to" values.
- Maintain a 5:1 ratio between memory (GiB) and network bandwidth (Gbps).
- Prefer NVMe storage instances (those with a
dsuffix:c6id,r6id,m6id). - Use
x86_64instances for incremental snapshots.
For example, a c6id.8xlarge instance provides 64 GiB memory and 12.5 Gbps guaranteed network bandwidth. This configuration can transfer the entire memory contents to S3 in approximately 70 seconds. Instances with memory:bandwidth ratios over 5:1 may not complete transfers before termination and risk task failures.
| Instance type | Cores | Memory (GiB) | Network bandwidth (Gbps) | Memory:bandwidth ratio | Estimated snapshot time |
|---|---|---|---|---|---|
c6id.4xlarge | 16 | 32 | 12.5 | 2.56:1 | ~45 seconds |
c6id.8xlarge | 32 | 64 | 12.5 | 5.12:1 | ~70 seconds |
r6id.2xlarge | 8 | 16 | 12.5 | 1.28:1 | ~20 seconds |
m6id.4xlarge | 16 | 64 | 12.5 | 5.12:1 | ~70 seconds |
c6id.12xlarge | 48 | 96 | 18.75 | 5.12:1 | ~70 seconds |
r6id.4xlarge | 16 | 128 | 12.5 | 10.24:1 | ~105 seconds |
m6id.8xlarge | 32 | 128 | 25 | 5.12:1 | ~70 seconds |
Incremental snapshots are enabled by default on x86_64 instances.
Resource limits
A single job can request more resources than are available on a single instance. To prevent this, set resource limits using the process.resourceLimits directive in your Nextflow configuration. See Resource limits for more information.
Manual cleanup
The /fusion folder in object storage may need manual cleanup. Administrators should verify Fusion has properly cleaned up and remove the folder if necessary.