Running without docker flag and with custom docker container

Question

Running without docker flag and with custom docker container

HenryCWong opened this issue 3 years ago · comments

Describe the bug
My university's compute system forces all jobs to go through a docker container. Thus, if I want to run the hic-pipeline on the compute cluster, I have to run a docker container within a docker container. The only way I could think of how to run the pipeline, would be to create a docker container based on the one provided in this repository and add caper into it.

Once I did I just get the current python error. I don't have the slightest idea how to resolve it so I was hoping for some direction to go towards. I really appreciate your guy's help.

OS/Platform

OS/Platform: Fedora 32
Pipeline version: I have no idea where to find this but I pulled the most recent one
Caper version: 1.6.2

Caper configuration file
backend=local

local-hash-strat=path+modtime

local-loc-dir=/storage1/fs1/dspencer/Active/wongh/hic-pipeline

cromwell=/home/wongh/.caper/cromwell_jar/cromwell-59.jar
womtool=/home/wongh/.caper/womtool_jar/womtool-59.jar

Error log
/usr/bin/python3: can't find 'main' module in ''

Paul Sud · Answer 1 · Thu Jun 03 2021 08:00:18 GMT+0800 (China Standard Time)

It is possible to run Docker in Docker, but it's only intended for specific purposes (e.g. continuous integration services). I wouldn't recommend it for production workflows, certainly not for running this pipeline.

Is your compute system supported by Caper? If so you might be able to use a custom backend in order to use the correct job submission command, see https://github.com/ENCODE-DCC/caper#running-pipelines-on-a-custom-backend . If your backend is not supported by Caper there are a couple additional options if you invoke Cromwell directly, see https://cromwell.readthedocs.io/en/stable/backends/Backends/

There are alternative WDL runners like miniwdl that may be helpful as well. miniwdl in particular looks like it can be run inside Docker, see https://miniwdl.readthedocs.io/en/latest/runner_advanced.html . However, I have no experience with it (or any other WDL runners besides Caper/Cromwell) and cannot guarantee it will work nor provide support for it.

It's a little hard for me to provide more specific help without knowing the details of your compute system. In general you will only be able to run this pipeline on a supported platform. Other methods of running it are not guaranteed to work.

Henry C. Wong · Answer 2 · Thu Jun 03 2021 08:10:11 GMT+0800 (China Standard Time)

My university uses LSF which is not supported by Caper.
Is there anyway I can get python working here? It seems like it might just be a path thing.

Henry C. Wong · Answer 3 · Thu Jun 03 2021 22:36:56 GMT+0800 (China Standard Time)

Thanks for the miniwdl tip. I'll try that out and get back to y'all on if this pipeline works with miniwdl.

Paul Sud · Answer 4 · Fri Jun 04 2021 00:13:36 GMT+0800 (China Standard Time)

Cromwell does support LSF: https://cromwell.readthedocs.io/en/stable/backends/LSF/ , although I'm not sure if it will work in your case. Might be worth looking into

Henry C. Wong · Answer 5 · Fri Jun 04 2021 06:38:30 GMT+0800 (China Standard Time)

It appears my university's compute cluster legitimately won't let any sort of docker run, even if it is a sibling instance. I also can't run singularity due to restriction of specific permissions on the compute cluster.

Paul Sud · Answer 6 · Sat Jun 05 2021 00:56:41 GMT+0800 (China Standard Time)

If you cannot run Docker or Singularity, then you would need to install all of the pipeline software locally, BWA, Juicer and its dependencies, and the custom scripts in this repo. I can't provide too much help with that nor can I say if it will work or not, but that would seem to be the way to go about it. I know Juicer is designed to work without containers on different clusters, you may want to look at its documentation. This pipeline is almost a one to one WDL wrapper around the Juicer pipeline.

Henry C. Wong · Answer 7 · Sat Jun 05 2021 04:35:45 GMT+0800 (China Standard Time)

If I run the encodedcc/hic-pipeline docker image that has caper in an interactive session will it run the pipeline?

Henry C. Wong · Answer 8 · Sat Jun 05 2021 04:38:21 GMT+0800 (China Standard Time)

I got LSF up and running with the following background script

backend {
  providers {
    pbs {
      config {
        submit = """bsub -J ${job_name} -o ${out} -e ${err} -G compute-oncology -q oncology-interactive -n 2 -M 64GB -R "rusage[mem=64GB] span[hosts=1]" -a "docker(encodedcc/hic-pipeline:0.4.0)" /bin/bash  ${script}
"""
        kill = "bkill ${job_id}"
        check-alive = "bjobs ${job_id}"
        job-id-regex = "(\\d+)"
      }
    }
  }
}

The -a flag is a specific flag for my university's compute cluster, that defines the docker container that the task is run within. So every bsub/qsub request is run inside a docker container (in this case the encodedcc/hic-pipeline container) I set it to the encode/hic-pipeline:0.4.0 container and I still get the error messages saying it can't find python stuff

* Recursively finding failures in calls (tasks)...

==== NAME=hic.normalize_assembly_name, STATUS=RetryableFailure, PARENT=
SHARD_IDX=-1, RC=1, JOB_ID=4562
START=2021-06-04T20:31:50.152Z, END=2021-06-04T20:32:06.386Z
STDOUT=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-normalize_assembly_name/execution/stdout
STDERR=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-normalize_assembly_name/execution/stderr
STDERR_CONTENTS=
/usr/bin/python3: can't find '__main__' module in ''
/usr/bin/python3: can't find '__main__' module in ''


==== NAME=hic.normalize_assembly_name, STATUS=Failed, PARENT=
SHARD_IDX=-1, RC=1, JOB_ID=4647
START=2021-06-04T20:32:09.788Z, END=2021-06-04T20:32:22.358Z
STDOUT=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-normalize_assembly_name/attempt-2/execution/stdout
STDERR=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-normalize_assembly_name/attempt-2/execution/stderr
STDERR_CONTENTS=
/usr/bin/python3: can't find '__main__' module in ''
/usr/bin/python3: can't find '__main__' module in ''


==== NAME=hic.get_ligation_site_regex, STATUS=RetryableFailure, PARENT=
SHARD_IDX=-1, RC=1, JOB_ID=4589
START=2021-06-04T20:31:53.805Z, END=2021-06-04T20:32:06.386Z
STDOUT=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-get_ligation_site_regex/execution/stdout
STDERR=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-get_ligation_site_regex/execution/stderr
STDERR_CONTENTS=
/usr/bin/python3: can't find '__main__' module in ''
/usr/bin/python3: can't find '__main__' module in ''


==== NAME=hic.get_ligation_site_regex, STATUS=Failed, PARENT=
SHARD_IDX=-1, RC=1, JOB_ID=4622
START=2021-06-04T20:32:07.791Z, END=2021-06-04T20:32:19.855Z
STDOUT=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-get_ligation_site_regex/attempt-2/execution/stdout
STDERR=/storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/hic/8e41ffca-a2ee-4699-988e-e661965650aa/call-get_ligation_site_regex/attempt-2/execution/stderr
STDERR_CONTENTS=
/usr/bin/python3: can't find '__main__' module in ''
/usr/bin/python3: can't find '__main__' module in ''

Henry C. Wong · Answer 9 · Sat Jun 05 2021 04:45:17 GMT+0800 (China Standard Time)

I am running caper as such caper run hic.wdl -i /storage1/fs1/dspencer/Active/wongh/hic2/hic-pipeline/tests/functional/json/test_hic.json --docker --backend-file lsf.backend.conf
With or without the docker flag I get the same error message.

Jin wook Lee · Answer 10 · Sat Jun 05 2021 05:27:09 GMT+0800 (China Standard Time)

Your backend is set as backend=local in your conf. so any modification to pbs backend is just ignored.
Please don't use --docker for this case it works completely different.
Caper/Cromwell will try to find outputs on a specified output directory (local-out-dir in your conf). I am not sure if output directory is correctly mapped into the docker container (keeping the same directory structure like how Singularity does).

So I suggest to use Conda instead. Don't use docker or singularity.
Create a Conda environment. Install any dependencies there.

Remove docker stuffs from your backend conf file and add -V (or something equivalent to it).
e.g. qsub -V passes all environment variables to worker compute nodes so dependencies in Conda environment will propagate to worker nodes too.
Also, please don't modify the original backend conf too mcuh please leave all SINGULARITY stuffs and memory,

backend {
  providers {
    pbs {
      config {
        submit = """if [ -z \"$SINGULARITY_BINDPATH\" ]; then export SINGULARITY_BINDPATH=${singularity_bindpath}; fi; \
if [ -z \"$SINGULARITY_CACHEDIR\" ]; then export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi;

echo "${if !defined(singularity) then '/bin/bash ' + script
        else
          'singularity exec --cleanenv ' +
          '--home ' + cwd + ' ' +
          (if defined(gpu) then '--nv ' else '') +
          singularity + ' /bin/bash ' + script}" | \
qsub \
    -N ${job_name} \
    -o ${out} \
    -e ${err} \
    ${true="-lnodes=1:ppn=" false="" defined(cpu)}${cpu}${true=":mem=" false="" defined(memory_mb)}${memory_mb}${true="mb" false="" defined(memory_mb)} \
    ${'-lwalltime=' + time + ':0:0'} \
    ${'-lngpus=' + gpu} \
    ${'-q ' + pbs_queue} \
    ${pbs_extra_param} \
    -V
"""
        kill = "qdel ${job_id}"
        check-alive = "qstat ${job_id}"
        job-id-regex = "(\\d+)"
      }
    }
  }
}

Replace qsub with bsub. Also carefully modify these lines.
Remove the gpu line. cpu and memory should be correctly defined in bsub's parameter set.

    ${true="-lnodes=1:ppn=" false="" defined(cpu)}${cpu}${true=":mem=" false="" defined(memory_mb)}${memory_mb}${true="mb" false="" defined(memory_mb)} \
    ${'-lwalltime=' + time + ':0:0'} \
    ${'-lngpus=' + gpu} \
    ${'-q ' + pbs_queue} \

$ source activate your-conda-env
$ caper run ... --backend pbs --backend-file your.backend.conf

Jin wook Lee · Answer 11 · Sat Jun 05 2021 05:29:07 GMT+0800 (China Standard Time)

Please keep caper run ... --backend pbs alive until the whole pipeline is done. This command line will run bsub each_task.sh ... for every task so memory, cpu, walltime should be correctly defined in the conf file.

Henry C. Wong · Answer 12 · Sat Jun 05 2021 05:56:12 GMT+0800 (China Standard Time)

Dependency propagation shouldn't be a problem (I think) because I specify on my compute cluster that I only want 1 host to run the entire process. The backend if set to pbs, not local. Docker shouldn't effect the file system (I think).

I will look to see if LSF has a -V alternative, but even if it does, my university's compute cluster requires every single job has to be run inside of a docker container. If I don't add a "-a docker(some container)" the process will not run.

I think this is my last idea on getting caper running on my university's compute cluster before I default to Juicer or spend the next few months writing my own. Any other ideas?

Jin wook Lee · Answer 13 · Sat Jun 05 2021 06:50:41 GMT+0800 (China Standard Time)

I see. If your job has to run inside a docker container then how does the volume mapping work (docker run -v) on your cluster?

If you want to run the whole pipeline on a single large compute node (e.g. 60GB 4cpus) then you don't have to use the pbs backend method.

Let's use local and think about how to run caper run WDL_INSIDE_DOCKER -i INPUT_JSON_INSIDE_DOCKER --max-concurrent-tasks 1 inside the container. If file paths defined in the input JSON are all valid inside the docker container then it will work and then write all outputs on CWD inside the docker container.

--max-concurrent-tasks 1 will serialize all tasks (e.g. mapping for each replicate) for the run so that it does not exceed reserved resources.

Henry C. Wong · Answer 14 · Mon Jun 07 2021 23:05:54 GMT+0800 (China Standard Time)

I ended up getting the backend file up and running by just hard coding all the file paths into the hic.wdl file. However, now I am running into other issues which I will save for another "issue" since it does not retain to this specific header. Thanks for all you help!