Data lineage

info

Data lineage in Platform is in public preview. It requires Nextflow 25.04 or later, AWS S3 object storage, and Amazon Simple Queue Service (SQS). For best results, use Nextflow 26.04 or later.

warning

The feature is experimental and subject to change. This page provides the latest configuration recommendations and limitations.

Data lineage tracks the full provenance of every pipeline run at both the task and workflow level, including what executed, what data it consumed, and what outputs it produced. Use it to audit results, verify reproducibility, and trace file provenance.

Why use data lineage

Production pipelines generate results that teams need to trust, audit, and reproduce. Data lineage provides a precise, immutable record of how each result was produced.

Reproducibility: Every run, task, and output file receives a unique lineage ID (LID), a traversable URI that points to a structured record of what ran. Verify that two runs produced identical results, or identify where they diverged.
Auditing and compliance: For teams in regulated industries such as pharma, clinical genomics, and contract research organizations (CROs), lineage provides the audit trail needed for regulatory compliance. Each record captures inputs, outputs, parameters, compute environment, and the user who launched the run.
Debugging: When a cached task unexpectedly re-executes, or a pipeline produces an unexpected result, lineage traces backward from any output to all contributing tasks and parameters. Compare two task runs to isolate what changed.
Broader team access: Exploring Nextflow lineage previously required CLI access and comfort reading raw JSON. Platform now surfaces lineage data in pipeline run detail pages and Data Explorer. Users can inspect provenance directly.
Cross-workflow discoverability: Workflow output labels make output files discoverable across runs. Navigate lineage records by label to find all matching outputs workspace-wide, without knowing which specific run produced a file.

How data lineage works

When lineage is enabled, Nextflow generates a structured JSON record for each entity in your pipeline during workflow execution:

Record type	Description
WorkflowRun	Full pipeline execution: repository, commit ID, parameters, compute environment, session ID, and Platform context (user, workspace, pipeline)
TaskRun	Individual task execution: script, code checksum, inputs, outputs, container, and dependencies
FileOutput	Output file: path, checksum, size, timestamp, and links back to the task and workflow that produced it

Each record gets a lineage ID (LID), a lid:// URI that uniquely identifies the entity. Every LID and lineage label renders as a clickable link, and you can navigate to all related entities across your organization.

Functional flow

Nextflow appends lineage record objects (*.data.json) to the defined object storage bucket.
The bucket is configured to filter for objects matching .data.json and sends object store notifications to the queue.
SQS queue receives s3:ObjectCreated:* events.
Platform reads the queue, returning the lineage objects created, and indexes them in the database.
The index enriches the run details.
The index enriches the display of workflow-generated objects in Data Explorer with links to the origin pipeline run and task, sources of the object, and any lineage labels associated with the object.

Enable data lineage

To start collecting data lineage for all pipeline runs in your workspace:

Open Settings > Workspace settings.
Select Lineage. If you don't see Lineage listed, contact your system administrator.
Toggle the Enable lineage by default on to collect data lineage for all pipeline runs in the workspace or toggle off to require per pipeline launch configuration. Choose either a Manual or an Automatic configuration for lineage resources:
- Manual: Define the credentials, region, object storage bucket and path, SQS queue name, and (optionally) SQS queue ARN.
- Automatic: Define the credentials, region, and (optionally) the object storage bucket and path where lineage data is stored and indexed. This is the default setting. If the storage bucket field is empty, a default bucket is generated for storing lineage data.
Once set and enabled, all pipeline runs in the workspace generate data lineage. See Lineage for more information about the settings.

danger

Updating the lineage settings after pipelines have generated lineage data will result in historic data loss. The lineage index is tied to the lineage storage bucket and path. Changing it makes existing records inaccessible. To avoid data loss when updating the storage location, first copy all existing lineage data to the new bucket and path (for example, aws s3 cp --recursive s3://old-bucket/path s3://new-bucket/path), then update the workspace setting.

When launching a pipeline in a data-lineage enabled workspace, the Enable lineage toggle in the pipeline Run setup reflects the Enable lineage by default workspace setting. Turn it off to explicitly exclude data lineage for the pipeline run.

tip

Maintain role users and above can toggle lineage on or off when launching a specific pipeline run.

IAM permissions required

Data lineage requires additional AWS IAM permissions. The permissions required depend on the role:

Platform integration credentials (IAM user): see AWS Batch — Data lineage or AWS Cloud — Data lineage
EC2 instance role / head job role (manually managed): see Manual AWS Batch configuration

Lineage labels

Assign lineage labels to output files using the label directive in your Nextflow process definitions. Labels appear in lineage records.

Both Platform labels and Nextflow lineage labels propagate to lineage records. Platform excludes resource labels because they relate to underlying compute resources, not the data itself.

info

Nextflow lineage labels are immutable. They are set at execution time and cannot be changed. Platform labels are mutable by design and can change after a run launches. Changing Platform labels after launch produces a mismatch between Platform run labels and Nextflow lineage labels.

Changing or disabling data lineage

If data lineage is changed from automatically-provisioned to manually-provisioned:

New object storage bucket: The bucket notification rule is cleared and the Platform-managed SQS queue is deleted. Some events may be missed. The bucket and its data are preserved.
Same object storage bucket, different SQS queue: The bucket notification rule is redirected to the new SQS queue ARN, and the old Platform-managed SQS queue is deleted. Some events may be missed. The bucket and its data are preserved.
Same object storage bucket, same SQS queue: No cloud provider resources change. All events, the bucket, and its data are preserved.

If data lineage is changed from manually provisioned to automatically provisioned a new object storage bucket, SQS queue, and notification are created by Platform. Previously defined bucket and data, SQS queue and notifications are preserved.

If data lineage is deactivated:

Automatically provisioned: Queue notification rule is cleared on the bucket, SQS queue deleted. Bucket and data are preserved.
Manually provisioned: No change to cloud resources. Bucket and data are preserved.

Data lineage displayed in Platform

Workflow run details

When a run was executed with lineage enabled, the run details page displays lineage data across the following tabs:

Run Info: Shows the lineage ID, lineage labels, and the full Platform context captured at execution time: user, workspace, compute environment, pipeline name, revision, and commit ID.
Tasks: Displays the lineage ID and lineage labels for each TaskRun alongside existing task data. You can trace any task back to its lineage record. All task file inputs and outputs, and upstream and downstream tasks linked by lineage records, are displayed.
Inputs: Lists all input datasets and parameters with file paths, types, and lineage IDs and lineage labels where available.
Outputs: Lists all FileOutput records linked to the workflow run: output name, file path, type, lineage ID, and lineage labels. Files link directly to Data Explorer.

tip

All LIDs and lineage labels are clickable links. Click any LID to open the organization-level lineage search pre-filled with that identifier.

note

If more than one Nextflow run publishes a file to the same destination, there are two lineage records. The FileOutput records for published files are saved under the lineage ID of the workflow run and can be used to differentiate them.

Data Explorer

Output objects from a lineage-enabled run display their LID and any lineage labels when you preview the object in Data Explorer. You can trace any file back to the pipeline run that produced it.

Advanced: Experimenting with data lineage

To test or troubleshoot data lineage for a specific pipeline, add the following to your Nextflow config file under Advanced options when adding a pipeline to the launchpad.

lineage.enabled = true
lineage.store.location = '<PATH_TO_STORAGE>'

To test for a single pipeline run, add the same code to your Nextflow config file under Advanced options when launching the pipeline run.

warning

If data lineage is defined for a workspace, only that data is displayed in Platform. Any unique specific pipeline or single pipeline run lineage data is only accessible via the AWS S3 console and other related services (such as Amazon Athena).

Costs associated with data lineage

Monthly S3 object storage bucket and SQS costs scale based on the number of pipeline runs launched with lineage enabled.

Typical SQS queue costs for a single rnaseq pipeline run daily are less than $10 USD/month.

Help

Company

Data lineage

Why use data lineage

How data lineage works

Functional flow

Enable data lineage

IAM permissions required

Lineage labels

Changing or disabling data lineage

Data lineage displayed in Platform

Workflow run details

Data Explorer

Advanced: Experimenting with data lineage

Costs associated with data lineage

Help

Company

Why use data lineage​

How data lineage works​

Functional flow​

Enable data lineage​

IAM permissions required​

Lineage labels​

Changing or disabling data lineage​

Data lineage displayed in Platform​

Workflow run details​

Data Explorer​

Advanced: Experimenting with data lineage​

Costs associated with data lineage​

Why use data lineage

How data lineage works

Functional flow

Enable data lineage

IAM permissions required

Lineage labels

Changing or disabling data lineage

Data lineage displayed in Platform

Workflow run details

Data Explorer

Advanced: Experimenting with data lineage

Costs associated with data lineage