seqeralabs / nf-tower

Nextflow Tower system

Home Page:https://tower.nf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Not able to execute docker-based workflows on the Google Life Sciences backend

daniloimparato opened this issue · comments

Hi!

I'm bringing this issue over from gitter.

I haven't been able to execute docker-based workflows with Tower on the Google Life Sciences backend.

Below is a minimal example:

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

process echo_remote_file_content {

  container = "docker.io/biocontainers/biocontainers:v1.2.0_cv1" // does not work, used by nf-core/viralrecon
  // container = "docker.io/docker/whalesay:latest" // works! both images are public

  input: path remote_file

  output: stdout emit: cat

  script: "cat $remote_file"
}

workflow {
  echo_remote_file_content(params.remote_file)
  println echo_remote_file_content.out.cat.view()
}

This is the execution log:

DataflowVariable(value=null)
Monitor the execution with Nextflow Tower using this url https://tower.nf/orgs/sensitive-org/workspaces/temp-workspace/watch/37ekJjsC1yeInU
[e3/683404] Submitted process > echo_remote_file_content
Error executing process > 'echo_remote_file_content'
Caused by:
  Process `echo_remote_file_content` terminated with an error exit status (9)
Command executed:
  cat str.txt
Command exit status:
  9
Command output:
  (empty)
Command error:
  Execution failed: generic::failed_precondition: while running "nf-e36834045cc88fa1d609cea576f05312-main": unexpected exit status 1 was not ignored
Work dir:
  gs://sensitive-bucket/scratch/37ekJjsC1yeInU/e3/6834045cc88fa1d609cea576f05312
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
my post-run script

Resolved configuration:

Resolved configuration
docker {
   enabled = true
}

params {
   remote_file = 'https://raw.githubusercontent.com/daniloimparato/hello/master/data/str.txt'
}

timeline {
   enabled = true
   file = '/.nextflow/cache/timeline-37ekJjsC1yeInU.html'
}

process {
   executor = 'google-lifesciences'
}

workDir = 'gs://sensitive-bucket/scratch/37ekJjsC1yeInU'

google {
   zone = 'us-east1-b,us-east1-c,us-east1-d'
   lifeSciences {
      bootDiskSize = '64.GB'
      preemptible = true
   }
}

runName = 'ecstatic_colden'

tower {
   enabled = true
   endpoint = '-'
}

This is the content of gs://sensitive-bucket/scratch/37ekJjsC1yeInU/nf-37ekJjsC1yeInU.log:

Oct-28 16:17:57.020 [main] DEBUG nextflow.cli.Launcher - $> nextflow run 'https://github.com/daniloimparato/hello' -name ecstatic_colden -params-file nf-37ekJjsC1yeInU.params.json -with-tower -r master -latest
Oct-28 16:17:57.188 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 21.08.0-edge
Oct-28 16:17:57.332 [main] DEBUG nextflow.plugin.PluginsFacade - Using Default plugins manager
Oct-28 16:17:57.356 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Oct-28 16:17:57.360 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Oct-28 16:17:57.369 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Oct-28 16:18:01.156 [main] DEBUG nextflow.scm.AssetManager - Repository URL: https://github.com/daniloimparato/hello; Project: daniloimparato/hello; Hub provider: github
Oct-28 16:18:01.179 [main] INFO  nextflow.cli.CmdRun - Pulling daniloimparato/hello ...
Oct-28 16:18:01.181 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials tower-nf:****************************************] -> https://api.github.com/repos/daniloimparato/hello/contents/nextflow.config
Oct-28 16:18:01.679 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials tower-nf:****************************************] -> https://api.github.com/repos/daniloimparato/hello/contents/main.nf
Oct-28 16:18:01.856 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials tower-nf:****************************************] -> https://api.github.com/repos/daniloimparato/hello
Oct-28 16:18:02.081 [main] DEBUG nextflow.scm.AssetManager - Pulling daniloimparato/hello -- Using remote clone url: https://github.com/daniloimparato/hello.git
Oct-28 16:18:03.840 [main] INFO  nextflow.cli.CmdRun -  downloaded from https://github.com/daniloimparato/hello.git
Oct-28 16:18:03.925 [main] INFO  nextflow.cli.CmdRun - Launching `daniloimparato/hello` [ecstatic_colden] - revision: 5ebfa90eee [master]
Oct-28 16:18:03.957 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /.nextflow/assets/daniloimparato/hello/nextflow.config
Oct-28 16:18:03.957 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /nextflow.config
Oct-28 16:18:03.965 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /.nextflow/assets/daniloimparato/hello/nextflow.config
Oct-28 16:18:03.966 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /nextflow.config
Oct-28 16:18:03.990 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Oct-28 16:18:04.079 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Oct-28 16:18:04.324 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; plugins-dir=/.nextflow/plugins
Oct-28 16:18:04.325 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[nf-tower@1.2.0, nf-google@1.1.0]
Oct-28 16:18:04.387 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins local root: .nextflow/plr/26c4b07213abceea955732d96c780d46
Oct-28 16:18:04.397 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Oct-28 16:18:04.398 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Oct-28 16:18:04.403 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Oct-28 16:18:04.426 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Oct-28 16:18:04.426 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-tower version: 1.2.0
Oct-28 16:18:04.453 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-tower@1.2.0' resolved
Oct-28 16:18:04.454 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-tower@1.2.0'
Oct-28 16:18:04.478 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-tower@1.2.0
Oct-28 16:18:04.478 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-google version: 1.1.0
Oct-28 16:18:04.495 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-google@1.1.0' resolved
Oct-28 16:18:04.495 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-google@1.1.0'
Oct-28 16:18:04.533 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-google@1.1.0
Oct-28 16:18:04.549 [main] DEBUG nextflow.plugin.PluginUpdater - Starting plugin nf-tower version: 1.2.0
Oct-28 16:18:04.721 [main] DEBUG nextflow.Session - Session uuid: e785c19b-f59c-40e6-80f2-682c2899ec98
Oct-28 16:18:04.723 [main] DEBUG nextflow.Session - Run name: ecstatic_colden
Oct-28 16:18:04.727 [main] DEBUG nextflow.Session - Executor pool size: 2
Oct-28 16:18:04.903 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /.nextflow/assets/daniloimparato/hello/nextflow.config
Oct-28 16:18:04.904 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /nextflow.config
Oct-28 16:18:04.904 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /.nextflow/assets/daniloimparato/hello/nextflow.config
Oct-28 16:18:04.904 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /nextflow.config
Oct-28 16:18:04.905 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Oct-28 16:18:05.001 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Oct-28 16:18:05.191 [main] DEBUG nextflow.cli.CmdRun -
  Version: 21.08.0-edge build 5609
  Created: 30-08-2021 16:39 UTC
  System: Linux 5.4.144+
  Runtime: Groovy 3.0.8 on OpenJDK 64-Bit Server VM 11.0.12+7-LTS
  Encoding: UTF-8 (UTF-8)
  Process: 181@c9ccd7167d28 [172.18.0.2]
  CPUs: 1 - Mem: 982.2 MB (68.9 MB) - Swap: 0 (0)
Oct-28 16:18:05.337 [main] DEBUG nextflow.file.FileHelper - Can't check if specified path is NFS (1): /scratch/37ekJjsC1yeInU

Oct-28 16:18:05.337 [main] DEBUG nextflow.Session - Work-dir: gs://sensitive-bucket/scratch/37ekJjsC1yeInU [null]
Oct-28 16:18:05.337 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /.nextflow/assets/daniloimparato/hello/bin
Oct-28 16:18:05.412 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[GoogleLifeSciencesExecutor]
Oct-28 16:18:05.448 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Oct-28 16:18:05.465 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Oct-28 16:18:05.600 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 2; maxThreads: 1000
Oct-28 16:18:05.817 [main] DEBUG nextflow.Session - Session start invoked
Oct-28 16:18:05.837 [main] DEBUG io.seqera.tower.plugin.TowerClient - Creating Tower observer -- endpoint=https://api.tower.nf; requestInterval=1s; aliveInterval=1m; maxRetries=5; backOffBase=3; backOffDelay=250
Oct-28 16:18:06.875 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Oct-28 16:18:06.947 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: echo_remote_file_content
Oct-28 16:18:07.113 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: google-lifesciences
Oct-28 16:18:07.113 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'google-lifesciences'
Oct-28 16:18:07.116 [main] DEBUG nextflow.executor.Executor - [warm up] executor > google-lifesciences
Oct-28 16:18:07.135 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'google-lifesciences' > capacity: 1000; pollInterval: 10s; dumpInterval: 5m
Oct-28 16:18:07.158 [main] DEBUG n.c.g.l.GoogleLifeSciencesExecutor - Google Life Science config=GoogleLifeSciencesConfig(project:bioinfo-dev-248419, zones:[us-east1-b, us-east1-c, us-east1-d], regions:[], preemptible:true, remoteBinDir:null, location:us-central1, disableBinDir:false, bootDiskSize:64 GB, cpuPlatform:null, sshDaemon:false, sshImage:gcr.io/cloud-genomics-pipelines/tools, debugMode:null, copyImage:google/cloud-sdk:slim, usePrivateAddress:false, enableRequesterPaysBuckets:false, network:null, subnetwork:null, serviceAccountEmail:null, parallelThreadCount:1, downloadMaxComponents:8, keepAliveOnFailure:false, maxParallelTransfers:4, maxTransferAttempts:1, delayBetweenAttempts:10s)
Oct-28 16:18:07.825 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 2; maxThreads: 1000
Oct-28 16:18:08.567 [main] INFO  io.seqera.tower.plugin.TowerClient - Monitor the execution with Nextflow Tower using this url https://tower.nf/orgs/sensitive-org/workspaces/temp-workspace/watch/37ekJjsC1yeInU
Oct-28 16:18:08.569 [main] DEBUG nextflow.Session - Ignite dataflow network (1)
Oct-28 16:18:08.570 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > echo_remote_file_content
Oct-28 16:18:08.575 [main] DEBUG nextflow.script.ScriptRunner - > Await termination
Oct-28 16:18:08.575 [main] DEBUG nextflow.Session - Session await
Oct-28 16:18:09.061 [Actor Thread 1] DEBUG n.util.BlockingThreadExecutorFactory - Thread pool name=FileTransfer; maxThreads=6; maxQueueSize=18; keepAlive=1m
Oct-28 16:18:09.778 [FileTransfer-thread-1] DEBUG nextflow.file.FilePorter - Copying foreign file https://raw.githubusercontent.com/daniloimparato/hello/master/data/str.txt to work dir: gs://sensitive-bucket/scratch/37ekJjsC1yeInU/stage/b1/4dc6278e23587b7c7eb8d37c533a60/str.txt
Oct-28 16:18:11.270 [Task submitter] DEBUG n.c.g.l.GoogleLifeSciencesTaskHandler - [GLS] Task submitted > echo_remote_file_content - Pipeline Id: 4190794867895990563
Oct-28 16:18:11.271 [Task submitter] INFO  nextflow.Session - [e3/683404] Submitted process > echo_remote_file_content
Oct-28 16:20:07.446 [Task monitor] DEBUG n.c.g.l.GoogleLifeSciencesTaskHandler - [GLS] Task complete > echo_remote_file_content - Start Time: 2021-10-28T16:18:22.497406473Z - End Time: 2021-10-28T16:19:58.823535922Z
Oct-28 16:20:07.520 [Task monitor] DEBUG n.c.g.l.GoogleLifeSciencesTaskHandler - [GLS] Cannot read exitstatus for task: `echo_remote_file_content` | gs://sensitive-bucket/scratch/37ekJjsC1yeInU/e3/6834045cc88fa1d609cea576f05312/.exitcode
Oct-28 16:20:07.524 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: echo_remote_file_content; status: COMPLETED; exit: 9; error: -; workDir: gs://sensitive-bucket/scratch/37ekJjsC1yeInU/e3/6834045cc88fa1d609cea576f05312]
Oct-28 16:20:07.606 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://sensitive-bucket/scratch/37ekJjsC1yeInU/e3/6834045cc88fa1d609cea576f05312/.command.out
Oct-28 16:20:07.619 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'echo_remote_file_content'

Caused by:
  Process `echo_remote_file_content` terminated with an error exit status (9)

Command executed:

  cat str.txt

Command exit status:
  9

Command output:
  (empty)

Command error:
  Execution failed: generic::failed_precondition: while running "nf-e36834045cc88fa1d609cea576f05312-main": unexpected exit status 1 was not ignored

Work dir:
  gs://sensitive-bucket/scratch/37ekJjsC1yeInU/e3/6834045cc88fa1d609cea576f05312

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Oct-28 16:20:07.644 [main] DEBUG nextflow.Session - Session await > all process finished
Oct-28 16:20:07.733 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `echo_remote_file_content` terminated with an error exit status (9)
Oct-28 16:20:07.792 [main] DEBUG nextflow.Session - Session await > all barriers passed
Oct-28 16:20:07.808 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=1m 50s; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=1; peakMemory=0; ]
Oct-28 16:20:07.810 [main] DEBUG nextflow.trace.TimelineObserver - Flow completing -- rendering html timeline
Oct-28 16:20:09.540 [main] DEBUG nextflow.CacheDB - Closing CacheDB done
Oct-28 16:20:09.541 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin 'nf-google@1.1.0'
Oct-28 16:20:09.542 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-google
Oct-28 16:20:09.542 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin 'nf-tower@1.2.0'
Oct-28 16:20:09.542 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-tower
Oct-28 16:20:09.570 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Looks a problem with the container image, tho can tricky to troubleshoot. Please try adding the line below in the config your pipeline or the "Nextflow config" field when launching the pipeline:

google.lifeSciences.sshDaemon = true

When fail please upload the content of the bucket workdir (including all subdirectories)

Hi @pditommaso!

Please try adding this line the config your pipeline of the "Nextflow config" field when launching the pipeline.

So I should add some code snippet to the Nextflow config textbox? I think something happened to your message, because the line you mentioned is missing.

You are right, fixed the comment.

sensitive-bucket.zip

There you go. This is the resulting folder structure:

sensitive-bucket/
└── scratch
    └── 4TLhlhNhyPcIH8
        ├── 81
        │   └── f02215b6d6936536c88bf211037c6e
        │       └── google
        │           └── logs
        │               ├── action
        │               │   ├── 1
        │               │   │   ├── stderr
        │               │   │   └── stdout
        │               │   ├── 2
        │               │   │   ├── stderr
        │               │   │   └── stdout
        │               │   ├── 3
        │               │   │   ├── stderr
        │               │   │   └── stdout
        │               │   └── 4
        │               │       ├── stderr
        │               │       └── stdout
        │               └── output
        ├── nf-4TLhlhNhyPcIH8.log
        ├── nf-4TLhlhNhyPcIH8.txt
        ├── stage
        │   └── ab
        │       └── 36021f2d5424707f66dd5db97c252b
        │           └── str.txt
        └── timeline-4TLhlhNhyPcIH8.html

When using the option the debug logs are copied into the google directory you can see above.

The output file contains this

2021/11/01 14:41:04 Listening on [::]:22...
/bin/bash: /scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e/.command.log: Permission denied
+ trap 'err=$?; exec 1>&2; gsutil -m -q cp -R /scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e/.command.log gs://sensitive-bucket/scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e/.command.log || true; [[ $err -gt 0 || $GOOGLE_LAST_EXIT_STATUS -gt 0 || $NXF_DEBUG -gt 0 ]] && { ls -lah /scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e || true; gsutil -m -q cp -R /google/ gs://sensitive-bucket/scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e; } || rm -rf /scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e; exit $err' EXIT
+ err=1
+ exec
+ gsutil -m -q cp -R /scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e/.command.log gs://sensitive-bucket/scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e/.command.log
+ [[ 1 -gt 0 ]]
+ ls -lah /scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e
total 48K
drwxr-xr-x 3 root root 4.0K Nov  1 14:41 .
drwxr-xr-x 3 root root 4.0K Nov  1 14:41 ..
-rw-r--r-- 1 root root  744 Nov  1 14:41 .command.log
-rw-r--r-- 1 root root  12K Nov  1 14:41 .command.run
-rw-r--r-- 1 root root   28 Nov  1 14:41 .command.sh
drwx------ 2 root root  16K Nov  1 14:39 lost+found
-rw-r--r-- 1 root root    4 Nov  1 14:41 str.txt
+ gsutil -m -q cp -R /google/ gs://sensitive-bucket/scratch/4TLhlhNhyPcIH8/81/f02215b6d6936536c88bf211037c6e

The error looks related to writing permission to the .command.log: Permission denied. Don't why this is happening but I suspect that container uses a custom user therefore when running it cannot write the file that only has permission for root.

@pditommaso Thanks for the reply.

Sure, that would make sense. However, I took that image straight from nf-core/viralrecon, which works fine on the AWS showcase environment. It also "works on my machine", all jokes aside.

This happened with other nf-core pipelines as well, but I couldn't test all of them. Can you confirm whether or not other users are currently able to run nf-core pipelines on the Google Life Sciences backend? It seems to me that this problem might be platform-specific...