Datasets

Datasets are CSV (comma-separated values) and TSV (tab-separated values) files stored in, or linked to, a workspace. Use them as pipeline inputs to simplify data management, reduce data-entry errors, and support reproducible analyses.

On the datasets screen, you can:

Upload directly or link to an externally hosted dataset.
View the count of pipeline runs in the workspace that have used a specific dataset input.
Apply multiple labels to datasets for easier searching and grouping.
Sort datasets by name, most recently updated, and most recently used.
Hide datasets that are not used in the workspace.
View dataset metadata (created by, last updated, last used).
Edit dataset details (name, description, and labels).
Create new versions of an uploaded dataset.

Benefits

Datasets reduce errors from manual data entry when you launch pipelines.
Datasets can be generated automatically in response to events (such as new-file notifications from S3 storage).
Datasets can simplify differential data analysis when you use the same pipeline to launch a run for each dataset as it becomes available.

Format

The most commonly used datasets for Nextflow pipelines are sample sheets, where each row contains a sample identifier, the location of that sample's files (such as FASTQ files), and other sample details. For example, nf-core/rnaseq works with input datasets (sample sheets) that include sample names, FASTQ file locations, and strandedness annotations. The Seqera Community Showcase sample dataset for nf-core/rnaseq looks like this:

Example rnaseq dataset

sample	fastq_1	fastq_2	strandedness
WT_REP1	s3://nf-core-awsmegatests/rnaseq/...	s3://nf-core-awsmegatests/rnaseq/...	reverse
WT_REP1	s3://nf-core-awsmegatests/rnaseq/...	s3://nf-core-awsmegatests/rnaseq/...	reverse
WT_REP2	s3://nf-core-awsmegatests/rnaseq/...	s3://nf-core-awsmegatests/rnaseq/...	reverse
RAP1_UNINDUCED_REP1	s3://nf-core-awsmegatests/rnaseq/...		reverse
RAP1_UNINDUCED_REP2	s3://nf-core-awsmegatests/rnaseq/...		reverse
RAP1_UNINDUCED_REP2	s3://nf-core-awsmegatests/rnaseq/...		reverse
RAP1_IAA_30M_REP1	s3://nf-core-awsmegatests/rnaseq/...	s3://nf-core-awsmegatests/rnaseq/...	reverse

note

Use Data Explorer to browse for cloud storage objects directly and copy the object paths to be used in your datasets.

Automation and pipeline schemas

Combine datasets, secrets, and actions to automate workflows that curate your data and maintain and launch pipelines in response to specific events. See workflow-automation for an example of pipeline workflow automation.

For your pipeline to use your dataset as input during runtime, information about the dataset and file format must be included in the relevant parameters of your pipeline schema. The pipeline schema specifies the accepted dataset file type in the mimetype attribute (either text/csv or text/tsv).

Dataset file content requirements and validation

Datasets can point to files stored in Amazon S3, GitHub, Hugging Face, and other locations. To stage the file paths defined in the dataset, Nextflow requires access to the infrastructure where the files reside, whether on cloud or HPC systems. Add the access keys for data sources that require authentication to your secrets.

note

Seqera doesn't validate your dataset file contents. While datasets can contain static file links, you're responsible for maintaining the access to that data.

Add a dataset

All Seqera user roles have access to the datasets feature in organization workspaces. There are two ways to add a dataset:

Direct upload: Best when you need immutability and the file is under 10 MB.
Link to an externally hosted file: Best for large files, but availability and immutability depend on the external hosting service.

Direct upload

In the sidebar navigation, select Datasets.
Select Add Dataset and choose Upload file.
Complete the Name and Description fields using information relevant to your dataset.
Optionally add one or more Labels to your dataset. You can use labels as a search filter but they don't apply to other resources in Seqera.
Upload a dataset to your workspace with drag-and-drop or use the Upload file file explorer dialog.
For datasets that use their first row for column names, customize the dataset view using the Set first row as header option.
Select Add.

warning

The size of the uploaded dataset file cannot exceed 10 MB.

Link to an externally hosted file

In the sidebar navigation, select Datasets.
Select Add Dataset and choose Link to URL.
Complete the Name and Description fields using information relevant to your dataset.
Optionally add one or more Labels to your dataset. You can use labels as a search filter but they don't apply to other resources in Seqera.
Copy and paste the dataset URL into the Dataset URL field.
For datasets that use their first row for column names, customize the dataset view using the Set first row as header option.
Select Add.
The dataset appears with a Linked badge.

Manage dataset versions

For directly uploaded datasets, Seqera can manage multiple versions.

note

For linked datasets, versioning is unavailable.

Add a dataset version

Select the three dots next to the dataset you want to add a new version for.
Select Add version.
Upload a dataset to your workspace with drag-and-drop or use the system Upload file file explorer dialog.
For datasets that use their first row for column names, customize the dataset view using the Set first row as header option.
Select Add.

caution

All subsequent versions of a dataset must be the same format (CSV or TSV) as the initial version.

View dataset versions

To see all versions of a dataset, use the Show drop-down in the Preview tab. Seqera automatically displays a preview of the most recent version and flags it as (latest), unless it is disabled.

To preview previous dataset versions, change the version from the Show drop-down. The Created by and Created on values also change.

To download a dataset version, select the Download icon.

To copy a permalink to the dataset, select the Copy icon.

Disable a dataset version

To disable one or more dataset versions, select Disable version. A disabled version cannot be selected as a pipeline input. If you disable the most recent version, the most recent non-disabled version is flagged as (latest).

note

For compliance reasons, datasets or dataset versions cannot be deleted, they can only be hidden or disabled, respectively.

Once disabled, a dataset version cannot be re-enabled.

Use a dataset

To use a dataset with pipelines added to your workspace:

Open any pipeline that contains a pipeline schema from the Launchpad.
Select the input field for the pipeline, removing any default values.
Pick the dataset to use as input to your pipeline.

note

The input field drop-down displays only datasets that match the file type specified in the nextflow_schema.json of the chosen pipeline. If the schema specifies "mimetype": "text/csv", no TSV datasets are available for use with that pipeline, and vice-versa. If multiple dataset versions exist, the pipeline input always defaults to the latest version.

Manage datasets

View runs

To view a list of all pipeline runs in a workspace that have used a specific dataset input either:

Select the three dots next to a dataset and select View runs.
Select the number in the Runs column.

Toggle dataset visibility

Select the three dots next to a dataset and select Mark dataset as hidden to hide a dataset no longer used in your workspace. To show a hidden dataset, select Mark dataset as visible. This filter applies to all workspace users.

You can toggle between Visible, Hidden, and All datasets in the Show drop-down on the main datasets page.

note

Hidden datasets do not count toward your per workspace quota.

Filter datasets

Filter the list of datasets to only display datasets that match one or more filters defined in the Search datasets field. Select the info icon to see the list of available filters.

Edit dataset details

Select the three dots next to a dataset to edit the name, description, and labels associated with a dataset.

Help

Company

Datasets

Benefits

Format

Automation and pipeline schemas

Dataset file content requirements and validation

Add a dataset

Direct upload

Link to an externally hosted file

Manage dataset versions

Add a dataset version

View dataset versions

Disable a dataset version

Use a dataset

Manage datasets

Help

Company

Benefits​

Format​

Automation and pipeline schemas​

Dataset file content requirements and validation​

Add a dataset​

Direct upload​

Link to an externally hosted file​

Manage dataset versions​

Add a dataset version​

View dataset versions​

Disable a dataset version​

Use a dataset​

Manage datasets​

Benefits

Format

Automation and pipeline schemas

Dataset file content requirements and validation

Add a dataset

Direct upload

Link to an externally hosted file

Manage dataset versions

Add a dataset version

View dataset versions

Disable a dataset version

Use a dataset

Manage datasets