Skip to main content
Version: 23.2

AWS troubleshooting

Unable to mount the FSx filesystem

While trying to mount the newly created FSx filesystem in the compute environment, the following error is observed in /var/log/tower-forge.log:

Click to expand error log!
mount.lustre: Can't parse NID 'fs-xxxxxxxxxxxx.fsx.us-east-1.amazonaws.com@tcp:/xxxxxxx

This mount helper should only be invoked via the mount (8) command, e.g. mount -t lustre dev dir

SOLUTION

Please enable DNS hostnames on your VPC.

Workflow execution fails with CannotStartContainerError

While trying to run workflows on AWS Batch environment with custom VPC and subnets environment, it is possible to encounter the CannotStartContainerError issue.

Click to expand error log!
Workflow execution completed unsuccessfully
The exit status of the task that caused the workflow execution to fail was: -
CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: RequestError: send request failed
caused by: Post https://logs.us-east-2.amazonaws.com/: dial tcp 10.20.10.16:443: i/o time

SOLUTION

This error is encountered when the custom VPC configuration are not specified in the Advanced settings in Tower for AWS Batch environment.

Workflow execution fails with java.net.UnknownHostException

Workflow executions fails upon launching and the following error message is observed in .nextflow.log file.

Click to expand error log!
java.net.UnknownHostException: <YOUR_TOWER_HOSTNAME>
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
at java.base/java.net.Socket.connect(Socket.java:609)
at java.base/java.net.Socket.connect(Socket.java:558)
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
at nextflow.file.http.XFileSystemProvider.newInputStream(XFileSystemProvider.groovy:291)
at java.base/java.nio.file.Files.newInputStream(Files.java:156)
at java.base/java.nio.file.Files.newBufferedReader(Files.java:2839)
at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1404)
at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:397)
at nextflow.scm.ProviderConfig.getFromFile(ProviderConfig.groovy:270)
at nextflow.scm.ProviderConfig.getDefault(ProviderConfig.groovy:287)
at nextflow.scm.AssetManager.<init>(AssetManager.groovy:107)
at nextflow.cli.CmdRun.getScriptFile(CmdRun.groovy:360)
at nextflow.cli.CmdRun.run(CmdRun.groovy:265)
at nextflow.cli.Launcher.run(Launcher.groovy:475)
at nextflow.cli.Launcher.main(Launcher.groovy:657)

SOLUTION

This error indicates that Nextflow running in AWS Batch jobs is not able to connect to your Tower instance.

The solution is to specify the correct VPC and Security Group during the creation of Compute Environment. For further information, please refer to the note regarding Networking configs after point-17 in the AWS compute environment setup guide.

Long running workflow fails because of failure to retrieve ECS metadata

Long running workflows eventually fail with partially completed processes, with the following error message:

Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Read timeout on endpoint URL: `http://<YOUR_HOST_IP>/v2/credentials/xxxxxxxxxx-a707-2cea702a1fb9`

SOLUTION

The solution is to increase the throttling rate in the user data script, in the launch template

echo "ECS_TASK_METADATA_RPS_LIMIT=120,180" >> /etc/ecs/ecs.config