Overview
Data Studios is a unified platform where you can host a combination of container images and compute environments for interactive analysis using your preferred tools, like JupyterLab and RStudio Notebooks, Visual Studio Code IDEs, or Xpra remote desktops. Each data studio session is an individual interactive environment that encapsulates the live environment for dynamic data analysis.
On Seqera Cloud, the free tier permits only one running data studio session at a time. To run simultaneous sessions, contact Seqera for a Seqera Cloud Pro license.
Data Studios is currently in public preview and is available from Seqera Platform v24.1. Contact Seqera support if you experience any problems during the deployment process. Data Studios in Enterprise is not enabled by default. You can enable Data Studios in the environment variables configuration.
Requirements
Before you get started, you need the following:
- Valid credentials to access your cloud storage data resources.
- At least the Maintain role set of permissions.
- A compute environment with sufficient resources. This is highly dependent on the volume of data you wish to process, but we recommended at least 2 CPUs allocated with 8192 MB of memory. See AWS Batch for more information about compute environment configuration.
- Data Explorer is enabled.
Currently, Data Studios only supports AWS Batch compute environments that do not have Fargate enabled.
Limitations
If you configured your AWS Batch compute environment to include an EFS file system with EFS file system > EFS mount path, the mount path must be explicitly specified. The mount path cannot be the same as your CE work directory. If the EFS file system is mounted as your CE work directory, Data Studios snapshots cannot be saved and studios sessions fail.
For more information on AWS Batch configuration, see AWS Batch.
Container image templates
Data Studios provides four container image templates: JupyterLab, RStudio Server, Visual Studio Code, and Xpra. The image templates install a very limited number of packages when the session container is built. You can install additional packages as needed during a session.
The image template tag includes the version of the analysis application, an optional incompatibility flag, and the Seqera Connect version. Connect is the proprietary Seqera webserver client that manages communication with the container. The tag string looks like this:
<tool_version>-[u<update_version>]-<connect_version>
<tool_version>
: Third-party analysis application that follows its own semantic versioning<major>.<minor>.<patch>
, such as4.2.5
for JupyterLab.<update_version>
: Optional analysis application update version, such asu1
, for instances where a backwards incompatible change is introduced.<connect_version>
: Seqera Connect client version, such as0.7
or0.7.0
.
Additionally, the Seqera Connect client version string has the format:
<major>.<minor>.<patch>
<major>
: Signifies major version changes in the underlying Seqera Connect client.<minor>
: Signifies breaking changes in the underlying Seqera Connect client.<patch>
: Signifies patch (non-breaking) changes in the underlying Seqera Connect client.
When pushed to the container registry, an image template is tagged with the following tags:
<tool_version>-<major>.<minor>
, such as4.2.3-0.7
. When adding a new data studio container template image this is the tag displayed in Seqera Platform.<tool_version>-<major>.<minor>.<patch>
, such as4.2.3-0.7.1
.
To view the latest versions of the images, see public.cr.seqera.io. You can also augment the Seqera-provided image templates or use your own custom container image templates. This approach is recommended for managing reproducible analysis environments. For more information, see Custom environments.
JupyterLab 4.2.5
The default user is the root
account. The following conda-forge packages are available by default:
python=3.13.0
pip=24.2
jedi-language-server=0.41.4
jupyterlab=4.2.5
jupyter-collaboration=1.2.0
jupyterlab-git=0.50.1
jupytext=1.16.4
jupyter-dash=0.4.2
ipywidgets=7.8.4
pandas[all]=2.2.3
scikit-learn=1.5.2
statsmodels=0.14.4
itables=2.2.2
seaborn[stats]=0.13.2
altair=5.4.1
plotly=5.24.1
r-ggplot2=3.5.1
nb_black=1.0.7
qgrid=1.3.1
To install additional Python packages during a running session, execute !pip install <packagename>
commands in your notebook environment. Additional system-level packages can be installed in a terminal window using apt install <packagename>
.
To see the list of all JupyterLab image templates available, see public.cr.seqera.io/repo/platform/data-studio-jupyter.
RStudio Server 4.4.1
The default user is the root
account. To install R packages during a running session, execute install.packages("<packagename>")
commands in your notebook environment. Additional system-level packages can be installed in a terminal window using apt install <packagename>
.
To see the list of all RStudio Server image templates available, see public.cr.seqera.io/repo/platform/data-studio-rstudio.
Visual Studio Code 1.93.1
Visual Studio Code is an integrated development environment (IDE) that supports many programming languages. The default user is the root
account. To install extensions during a running session, select Extensions. Additional system-level packages can be installed in a terminal window using apt install <packagename>
.
To see the list of all Visual Studio Code image templates available, see public.cr.seqera.io/platform/data-studio-vscode.
Xpra 6.2.0
Xpra, known as screen for X, allows you to run X11 programs by giving you remote access to individual graphical applications. The container template image also installs NVIDIA Linux x64 (AMD64/EM64T) drivers for Ubuntu 22.04 for running GPU-enabled applications. To use these GPU drivers, your compute environment must specify GPU instance families.
The default user is the root
account. The image is based on ubuntu:jammy
. Additional system-level packages can be installed during a running session in a terminal window using apt install <package_name>
.
To see the list of all Xpra image templates available, see public.cr.seqera.io/repo/platform/data-studio-xpra.
Session statuses
Data studios have the following possible statuses:
-
building: When a custom environment is building the template image for a new data studio session. The Wave service performs the build action. For more information on this status, see Inspect custom container template build status.
-
build-failed: When a custom environment build has failed. This is a non-recoverable error. Logs are provided to assist with troubleshooting. For more information on this status, see Inspect custom container template build status.
-
starting: The data studio is initializing.
-
running: When a data studio session is running, you can connect to it, copy the data studio URL, or stop it. In addition, the session can continue to process requests/run computations in the absence of an ongoing connection.
-
stopping: The recently-running session is in the process of being stopped.
-
stopped: When a session is stopped, the associated compute resources are deallocated. You can start or delete the data studio when it's in this state.
-
errored: This state most often indicates that there has been an error starting the data studio session but it is in a stopped state. There might be errors reported by the session itself but these will be overwritten with a running status if the data studio session is still running.
If you encounter an error with the public preview release of Data Studios, contact Seqera support.
Session checkpoints
When you start a session, it automatically creates a checkpoint. A checkpoint saves changes that you make to the root filesystem and stores it in the compute environment's pipeline work directory in the .studios/checkpoints
folder with a unique name. The checkpoint is updated every five minutes.
When you stop and start a data studio session, or start a new data studio session from a previously created checkpoint, changes such as installed software packages and configuration files are restored and made available in the data studio session. Changes made to mounted data are not included in a checkpoint.
Checkpoints can be renamed and the name has to be unique per data studio. Spaces in checkpoint names are converted to underscores automatically.
Checkpoint files in the compute environment work directory may be shared by multiple data studios. Each checkpoint file is cleaned up asynchronously after the last data studio referencing the checkpoint is deleted.
The cleanup process is a best effort and not guaranteed. Seqera attempts to remove the checkpoint, but it can fail if, for example, the compute environment credentials used do not have sufficient permissions to delete objects from storage buckets.
Session volume automatic resizing
By default, a session allocates an initial 2 GB of storage. Available disk space is continually monitored and if the available space drops below a 1 GB threshold, the file system is dynamically-resized to include an additional 2 GB of available disk space.
This approach ensures that a session doesn't initially include unnecessary free disk space, while providing the flexibility to accommodate installation of large software packages required for data analysis.
The maximum storage allocation for a session is limited by the compute environment disk boot size. By default, this is 30 GB. This limit is shared by all sessions running in the same compute environment.
If the maximum allocation size is reached, it is possible to reclaim storage space using a snapshot.
Stop the active session to trigger a snapshot from the active volume. Data Studios uploads the snapshot to cloud storage with Fusion. When you start from the newly saved snapshot, all previous data is loaded and the newly-started session will have 2 GB of available space.