running pipeline locally with singularity (or docker)

Question

running pipeline locally with singularity (or docker)

alxsimon opened this issue 4 years ago · comments

Hi,
From the documentation, it seems possible to run the pipeline locally on a draft assembly using the docker container (either using docker itself or singularity).

However I have a hard time navigating all the configurations and requirements.
Could it be possible to add an example of such use case in the repository or documentation?

Thanks,
Alexis

Richard Challis · Answer 1 · Fri Jun 05 2020 23:34:16 GMT+0800 (China Standard Time)

Hi Alexis

Thanks for the suggestion - I'll try to add a clear example to the docs next week.

Richard Challis · Answer 2 · Fri Jun 12 2020 18:46:41 GMT+0800 (China Standard Time)

I've added a new page to the Pipeline docs at blobtoolkit.genomehubs.org/pipeline/pipeline-tutorials/running-the-pipeline-in-a-container/ that hopefully makes things a bit clearer for the specific case of running a local assembly in Docker. Do let me know how you get on following this.

Alexis Simon · Answer 3 · Fri Jun 12 2020 20:03:08 GMT+0800 (China Standard Time)

Thank you very much, it is now very clear how the different layers of tools interact.

I encountered a first problem easy to solve, in the snakemake command you should add the option
-s /blobtoolkit/insdc-pipeline/Snakefile.
Otherwise Snakmake complains it cannot find the Snakefile.

Richard Challis · Answer 4 · Fri Jun 12 2020 20:06:49 GMT+0800 (China Standard Time)

Thanks for spotting that - I've added the option to the docs

Alexis Simon · Answer 5 · Fri Jun 12 2020 20:54:26 GMT+0800 (China Standard Time)

Not related to the containerized execution I think (or maybe if the pipeline version in the container is too old) but I have an error fetching the blast ncbi db.
No matches on pattern 'nt_v5.??.tar.gz'

When looking in ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/ I don't find indeed any nt_v5 file, maybe it should be changed to nt.??.tar.gz ?

Richard Challis · Answer 6 · Fri Jun 12 2020 23:05:28 GMT+0800 (China Standard Time)

should definitely be nt.??.tar.gz. The pipeline used to have to look for nt_v5 before blast made it the default and decided not to keep an alias. It should only do this if nt_v5 is used as the db name in the config - I thought I'd removed the instances of nt_v5 in the docs but just found a few that had slipped through.

Alexis Simon · Answer 7 · Sun Jun 14 2020 20:58:57 GMT+0800 (China Standard Time)

Hi, the pipeline was running smoothly until I got an error in the blobtoolkit_create rule.
The snakemake error is the missing output file (even when increasing --latency-wait).

I am wondering if the shell command should be blobtools create instead of blobtools replace?

Alexis Simon · Answer 8 · Mon Jun 22 2020 21:34:58 GMT+0800 (China Standard Time)

In fact the error is as follows in the log of the blobtools_create rule:

Traceback (most recent call last):   
File "/blobtoolkit/blobtools2/lib/add.py", line 65, in <module>
  import blob_db
File "/blobtoolkit/blobtools2/lib/blob_db.py", line 10, in <module>
  import cov
File "/blobtoolkit/blobtools2/lib/cov.py", line 15, in <module>
  import pysam
File "/home/blobtoolkit/miniconda3/envs/btk_env/lib/python3.7/site-packages/pysam/__init__.py", line 5, in <module>
  from pysam.libchtslib import *
ModuleNotFoundError: No module named 'pysam.libchtslib'

If I try to create the conda env from the blobtools2.yaml I can import correctly the faulty module, so I don't know where this comes from.

Alexis Simon · Answer 9 · Mon Jun 22 2020 21:50:41 GMT+0800 (China Standard Time)

Error above was when using singularity as the container manager,
However switching back to docker I have a completely different error

Loading sequences from Gallo_Med_v1.fasta  
Traceback (most recent call last): 
  File "/blobtoolkit/blobtools2/lib/add.py", line 153, in <module>  
    main()  
  File "/blobtoolkit/blobtools2/lib/add.py", line 120, in main 
    meta=meta) 
  File "/blobtoolkit/blobtools2/lib/fasta.py", line 56, in parse
    _gc_portions[seq_id], _n_counts[seq_id] = base_composition(seq_str) 
  File "/blobtoolkit/blobtools2/lib/fasta.py", line 29, in base_composition
    gc_portion = float("%.4f" % (gc_count / acgt_count)) 
ZeroDivisionError: division by zero

EDIT: sorry did not see issue #7 of blobtools2

Richard Challis · Answer 10 · Mon Jun 22 2020 22:41:53 GMT+0800 (China Standard Time)

Hi - sorry not to have been active on this over the last week. Thanks for pasting in the errors from the log files.

The Docker error looks like it is due to a sequence with no ACGT bases - I'll need to add some code to catch this and print a warning. Could you check your assembly for contigs with only Ns to confirm this? EDIT: sorry, did not see your edit above.

As for singularity - I'm not sure why it is not finding the module. Sometimes Docker images don't behave as expected with singularity so I expect I will have to make a specific singularity image rather than relying on the Docker one.

Alexis Simon · Answer 11 · Mon Jun 22 2020 22:49:54 GMT+0800 (China Standard Time)

Thanks and no worries, I went back to it only today myself.

Indeed there was some N-only sequences that I removed from the reference now, I relaunched the pipeline with docker and will see if it finishes.

I tried to use singularity because I know it way better than docker, but I guess having a docker only image is fine.
(Well in fact there is another reason I wanted to use singularity, which is I plan to include the insdc pipeline into a bigger snakemake pipeline, I don't know if this will work in the end but I thought using singularity would simplify the compatibility.)

Alexis Simon · Answer 12 · Thu Jun 25 2020 20:50:43 GMT+0800 (China Standard Time)

It finished OK when using docker.