RNA-Seq
This guide details how to run bulk RNA sequencing (RNA-Seq) data analysis, from quality control to differential expression analysis, on an AWS Batch compute environment in Platform. It includes:
- Creating an AWS Batch compute environment to run your pipeline and analysis environment
- Adding pipelines to your workspace
- Importing your pipeline input data
- Launching the pipeline and monitoring execution from your workspace
- Setting up a custom analysis environment with Studios
- Resource allocation guidance for RNA-Seq data
You will need the following to get started:
- Admin permissions in an existing organization workspace. See Set up your workspace to create an organization and workspace from scratch.
- An existing AWS cloud account with access to the AWS Batch service.
- Existing access credentials with permissions to create and manage resources in your AWS account. See IAM for guidance to set up IAM permissions for Platform.
Compute environment
Compute and storage requirements for RNA-Seq analysis are dependent on the number of samples and the sequencing depth of your input data. See RNA-Seq data and requirements for details on RNA-Seq datasets and the CPU and memory requirements for important steps of RNA-Seq pipelines.
In this guide, you will create an AWS Batch compute environment with sufficient resources allocated to run the nf-core/rnaseq pipeline with a large dataset. This compute environment will also be used to run a Studios RStudio environment for interactive analysis of the resulting pipeline data.
The compute recommendations below are based on internal benchmarking performed by Seqera. See RNA-Seq data and requirements for more information.
Recommended compute environment resources
The following compute resources are recommended for production RNA-Seq pipelines, depending on the size of your input dataset:
Setting | Value |
---|---|
Instance Types | m5,r5 |
vCPUs | 2 - 8 |
Memory (GiB) | 8 - 32 |
Max CPUs | >500 |
Min CPUs | 0 |
Fusion file system
The Fusion file system enables seamless read and write operations to cloud object stores, leading to simpler pipeline logic and faster, more efficient execution. While Fusion is not required to run nf-core/rnaseq, it is recommended for optimal performance. See nf-core/rnaseq performance in Platform at the end of this guide.
Fusion works best with AWS NVMe instances (fast instance storage) as this delivers the fastest performance when compared to environments using only AWS EBS (Elastic Block Store). Batch Forge selects instances automatically based on your compute environment configuration, but you can optionally specify instance types. To enable fast instance storage (see Create compute environment below), you must select EC2 instances with NVMe SSD storage (m5d
or r5d
families).
Fusion requires a license for use in Seqera Platform compute environments or directly in Nextflow. Fusion can be trialed at no cost. Contact Seqera for more details.
Create compute environment
From the Compute Environments tab in your organization workspace, select Add compute environment and complete the following fields:
Field | Description |
---|---|
Name | A unique name for the compute environment. |
Platform | AWS Batch |
Credentials | Select existing credentials, or + to create new credentials: |
Access Key | AWS access key ID. |
Secret Key | AWS secret access key. |
Region | The target execution region. |
Pipeline work directory | An S3 bucket path in the same execution region. |
Enable Wave Containers | Use the Wave containers service to provision containers. |
Enable Fusion v2 | Access your S3-hosted data via the Fusion v2 file system. |
Enable fast instance storage | Use NVMe instance storage to speed up I/O and disk access. Requires Fusion v2. |
Config Mode | Batch Forge |
Provisioning Model | Choose between Spot and On-demand instances. |
Max CPUs | Sensible values for production use range between 2000 and 5000. |
Enable Fargate for head job | Run the Nextflow head job using the Fargate container service to speed up pipeline launch. Requires Fusion v2. |
Allowed S3 buckets | Additional S3 buckets or paths to be granted read-write permission for this compute environment. Add data paths to be mounted in your data studio here, if different from your pipeline work directory. |
Resource labels | name=value pairs to tag the AWS resources created by this compute environment. |
Add pipeline to Platform
The nf-core/rnaseq pipeline is a highly configurable and robust workflow designed to analyze RNA-Seq data. It performs quality control, alignment and quantification.
Seqera Pipelines is a curated collection of quality open-source pipelines that can be imported directly to your workspace Launchpad in Platform. Each pipeline includes a dataset to use in a test run to confirm compute environment compatibility in just a few steps.
To use Seqera Pipelines to import the nf-core/rnaseq pipeline to your workspace:
- Search for nf-core/rnaseq and select Launch next to the pipeline name in the list. In the Add pipeline tab, select Cloud or Enterprise depending on your Platform account type, then provide the information needed for Seqera Pipelines to access your Platform instance:
- Seqera Cloud: Paste your Platform Access token and select Next.
- Seqera Enterprise: Specify the Seqera Platform URL (hostname) and Base API URL for your Enterprise instance, then paste your Platform Access token and select Next.
tipIf you do not have a Platform access token, select Get your access token from Seqera Platform to open the Access tokens page in a new browser tab.
- Select your Platform Organization, Workspace, and Compute environment for the imported pipeline.
- (Optional) Customize the Pipeline Name and Pipeline Description.
- Select Add Pipeline.
To add a custom pipeline not listed in Seqera Pipelines to your Platform workspace, see Add pipelines for manual Launchpad instructions.
Pipeline input data
The nf-core/rnaseq pipeline works with input datasets (samplesheets) containing sample names, FASTQ file locations (paths to FASTQ files in cloud or local storage), and strandedness. For example, the dataset used in the test_full
profile is derived from the publicly available iGenomes collection of datasets, commonly used in bioinformatics analyses.
This dataset represents RNA-Seq samples from various human cell lines (GM12878, K562, MCF7, and H1) with biological replicates, stored in an AWS S3 bucket (s3://ngi-igenomes
) as part of the iGenomes resource. These RNA-Seq datasets consist of paired-end sequencing reads, which can be used to study gene expression patterns in different cell types.
nf-core/rnaseq test_full profile dataset
sample | fastq_1 | fastq_2 | strandedness |
---|---|---|---|
GM12878_REP1 | s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX1603629_T1_2.fastq.gz | reverse |
GM12878_REP2 | s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX1603630_T1_2.fastq.gz | reverse |
K562_REP1 | s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX1603392_T1_2.fastq.gz | reverse |
K562_REP2 | s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX1603393_T1_2.fastq.gz | reverse |
MCF7_REP1 | s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX2370490_T1_2.fastq.gz | reverse |
MCF7_REP2 | s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX2370491_T1_2.fastq.gz | reverse |
H1_REP1 | s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX2370468_T1_2.fastq.gz | reverse |
H1_REP2 | s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_1.fastq.gz | s3://ngi-igenomes/test-data/rnaseq/SRX2370469_T1_2.fastq.gz | reverse |
In Platform, samplesheets and other data can be made easily accessible in one of two ways:
- Use Data Explorer to browse and interact with remote data from AWS S3, Azure Blob Storage, and Google Cloud Storage repositories, directly in your organization workspace.
- Use Datasets to upload structured data to your workspace in CSV (Comma-Separated Values) or TSV (Tab-Separated Values) format.
Add a cloud bucket via Data Explorer
Private cloud storage buckets accessible with the credentials in your workspace are added to Data Explorer automatically by default. However, you can also add custom directory paths within buckets to your workspace to simplify direct access.
To add individual buckets (or directory paths within buckets):
- From the Data Explorer tab, select Add cloud bucket.
- Specify the bucket details:
- The cloud Provider.
- An existing cloud Bucket path.
- A unique Name for the bucket.
- The Credentials used to access the bucket. For public cloud buckets, select Public.
- An optional bucket Description.
- Select Add.
You can now select data directly from this bucket as input when launching your pipeline, without the need to interact with cloud consoles or CLI tools.
Add a dataset
From the Datasets tab, select Add Dataset.
Specify the following dataset details:
- A Name for the dataset, such as
nf-core-rnaseq-dataset
. - A Description for the dataset.
- Select the First row as header option to prevent Platform from parsing the header row of the samplesheet as sample data.
- Select Upload file and browse to your CSV or TSV samplesheet file in local storage, or drag and drop it into the box.
The dataset is now listed in your organization workspace datasets and can be selected as input when launching your pipeline.
Platform does not store the data used for analysis in pipelines. The dataset must specify the locations of data stored on your own infrastructure.
Launch pipeline
This guide is based on version 3.15.1 of the nf-core/rnaseq pipeline. Launch form parameters and tools may differ in other versions.
With your compute environment created, nf-core/rnaseq added to your workspace Launchpad, and your samplesheet accessible in Platform, you are ready to launch your pipeline. Navigate to the Launchpad and select Launch next to nf-core-rnaseq to open the launch form.
The launch form consists of General config, Run parameters, and Advanced options sections to specify your run parameters before execution, and an execution summary. Use section headings or select the Previous and Next buttons at the bottom of the page to navigate between sections.
General config
- Pipeline to launch: The pipeline Git repository name or URL. For saved pipelines, this is prefilled and cannot be edited.
- Revision number: A valid repository commit ID, tag, or branch name. For saved pipelines, this is prefilled and cannot be edited.
- Config profiles: One or more configuration profile names to use for the execution. Config profiles must be defined in the
nextflow.config
file in the pipeline repository. - Workflow run name: An identifier for the run, pre-filled with a random name. This can be customized.
- Labels: Assign new or existing labels to the run.
- Compute environment: Your AWS Batch compute environment.
- Work directory: The cloud storage path where pipeline scratch data is stored. Platform will create a scratch sub-folder if only a cloud bucket location is specified.
note
The credentials associated with the compute environment must have access to the work directory.
Run parameters
There are three ways to enter Run parameters prior to launch:
- The Input form view displays form fields to enter text or select attributes from lists, and browse input and output locations with Data Explorer.
- The Config view displays raw configuration text that you can edit directly. Select JSON or YAML format from the View as list.
- Upload params file allows you to upload a JSON or YAML file with run parameters.
Platform uses the nextflow_schema.json
file in the root of the pipeline repository to dynamically create a form with the necessary pipeline parameters.
Specify your pipeline input and output and modify other pipeline parameters as needed.
input
Use Browse to select your pipeline input data:
- In the Data Explorer tab, select the existing cloud bucket that contains your samplesheet, browse or search for the samplesheet file, and select the chain icon to copy the file path before closing the data selection window and pasting the file path in the input field.
- In the Datasets tab, search for and select your existing dataset.
outdir
Use the outdir
parameter to specify where the pipeline outputs are published. outdir
must be unique for each pipeline run. Otherwise, your results will be overwritten.
Browse and copy cloud storage directory paths using Data Explorer, or enter a path manually.
Modify other parameters to customize the pipeline execution through the parameters form. For example, under Read trimming options, change the trimmer
and select fastp
instead of trimgalore
.
Advanced settings
- Use resource labels to tag the computing resources created during the workflow execution. While resource labels for the run are inherited from the compute environment and pipeline, workspace admins can override them from the launch form. Applied resource label names must be unique.
- Pipeline secrets store keys and tokens used by workflow tasks to interact with external systems. Enter the names of any stored user or workspace secrets required for the workflow execution.
- See Advanced options for more details.
After you have filled the necessary launch details, select Launch. The Runs tab shows your new run in a submitted status at the top of the list. Select the run name to navigate to the View Workflow Run page and view the configuration, parameters, status of individual tasks, and run report.
Run monitoring
Select your new run from the Runs tab list to view the run details.
Run details page
As the pipeline runs, run details will populate with the following tabs:
- Command-line: The Nextflow command invocation used to run the pipeline. This includes details about the pipeline version (
-r
flag) and profile, if specified (-profile
flag). - Parameters: The exact set of parameters used in the execution. This is helpful for reproducing the results of a previous run.
- Resolved Nextflow configuration: The full Nextflow configuration settings used for the run. This includes parameters, but also settings specific to task execution (such as memory, CPUs, and output directory).
- Execution Log: A summarized Nextflow log providing information about the pipeline and the status of the run.
- Datasets: Link to datasets, if any were used in the run.
- Reports: View pipeline outputs directly in the Platform.
View reports
Most Nextflow pipelines generate reports or output files which are useful to inspect at the end of the pipeline execution. Reports can contain quality control (QC) metrics that are important to assess the integrity of the results.
For example, for the nf-core/rnaseq pipeline, view the MultiQC report generated. MultiQC is a helpful reporting tool to generate aggregate statistics and summaries from bioinformatics tools.
The paths to report files point to a location in cloud storage (in the outdir
directory specified during launch), but you can view the contents directly and download each file without navigating to the cloud or a remote filesystem.
See Reports for more information.
View general information
The run details page includes general information about who executed the run, when it was executed, the Git commit ID and/or tag used, and additional details about the compute environment and Nextflow version used.
View details for a task
Scroll down the page to view:
- The progress of individual pipeline Processes
- Aggregated stats for the run (total walltime, CPU hours)
- Workflow metrics (CPU efficiency, memory efficiency)
- A Task details table for every task in the workflow
The task details table provides further information on every step in the pipeline, including task statuses and metrics.
Task details
Select a task in the task table to open the Task details dialog. The dialog has three tabs:
- The About tab contains extensive task execution details.
- The Execution log tab provides a real-time log of the selected task's execution. Task execution and other logs (such as stdout and stderr) are available for download from here, if still available in your compute environment.
- The Data Explorer tab allows you to view the task working directory directly in Platform.
Nextflow hash-addresses each task of the pipeline and creates unique directories based on these hashes. Data Explorer allows you to view the log files and output files generated for each task in its working directory, directly within Platform. You can view, download, and retrieve the link for these intermediate files in cloud storage from the Data Explorer tab to simplify troubleshooting.