PengNi / longmethyl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

longmethyl

Learning Nextflow - A demo nextflow pipeline of methylation detection using long reads

Contents

Installation

  • (1) Install conda from Conda if neeeded.

  • (2) Install nextflow.

# create a new environment and install nextflow in it
conda create -n nextflow -c conda-forge -c bioconda nextflow

# or install nextflow in an existing environment
conda install -c conda-forge -c bioconda nextflow
  • (3) Download longmethyl from github.
git clone https://github.com/PengNi/longmethyl.git
conda install -c conda-forge graphviz

Demo data

Check longmethyl/demo for demo data:

  • fast5_chr20.tar.gz: 60 HG002 fast5s which align to human genome chr20:10000000-10100000.
  • chr20_demo.fa: reference sequence of human chr20:10000000-10100000.
  • hg002_bsseq_chr20_demo.bed: HG002 BS-seq results of region chr20:10000000-10100000.

If you are using Conda to run longmethyl, check also google drive to get deepsignal CpG model-model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz.

Usage

The longmethyl pipeline is for methylation calling from nanopore reads as following:

longmethyl-tubemap

Option 1. Run with singularity (recommended)

If it is the first time you run with singularity (e.g. using -profile singularity), the following cmd will cache the dafault singularity image (--singularity_name) to the --singularity_cache directory (default: local_singularity_cache) first. There will be a .img file in the --singularity_cache directory.

# activate nextflow environment
conda activate nextflow

# run longmethyl, this cmd will cache a singularity image before processing the data
nextflow run ~/tools/longmethyl -profile singularity \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz
# or, run longmethyl using GPU, set CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0 nextflow run ~/tools/longmethyl -profile singularity \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz

The downloaded .img file can be used then, without being downloaded again:

# this time nextflow will not download the singularity image again, it has already
# been in the --singularity_cache directory.
nextflow run ~/tools/longmethyl -profile singularity \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz
# or
nextflow run ~/tools/longmethyl -profile singularity \
    --singularity_cache local_singularity_cache \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz
# or
nextflow run ~/tools/longmethyl -profile singularity \
    --singularity_name local_singularity_cache/nipengcsu-longmethyl-0.3.img \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz

The singularity image can be also pulled before running the cmd. The pulled .sif file is only needed to be downloaded once.

# pull singularity image (once for all). There will be a .sif file. 
singularity pull docker://nipengcsu/longmethyl:0.3

# run longmethyl
nextflow run ~/tools/longmethyl -profile singularity \
    --singularity_name longmethyl_0.3.sif \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz

Option 2. Run with docker

  • (1) Pull docker image (once for all).

It is better to pull docker image before running pipeline the first time, cause this may be time-consuming and there may be network issues. However, this step is not necessary, the image will be pulled automatically when running the pipeline the first time.

docker pull nipengcsu/longmethyl:0.3
  • (2) Run longmethyl using -profile docker.
# activate nextflow environment
conda activate nextflow

# run longmethyl using cpu
nextflow run ~/tools/longmethyl -profile docker \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz

Currently longmethyl CANNOT run with docker on a GPU machine.

# TODO: run longmethyl using GPU, set CUDA_VISIBLE_DEVICES and --gpu
CUDA_VISIBLE_DEVICES=0 nextflow run ~/tools/longmethyl -profile docker --gpu true \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz

Related issues:

  1. For No swap limit support
# for Ubuntu

# (1) sudo, Edit the /etc/default/grub file. Add or edit the GRUB_CMDLINE_LINUX line 
# to add the following two key-value pairs
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

# (2) Update GRUB
sudo update-grub

# (3) Restart the machine
sudo reboot

Ref: https://unix.stackexchange.com/questions/342735/docker-warning-no-swap-limit-support

  1. For docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
# for Ubuntu

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update

sudo apt-get install -y nvidia-docker2

sudo systemctl restart docker

Ref: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

  1. For Failed to initialize NVML: Driver/library version mismatch

Ref: NVIDIA/nvidia-docker#584

Option 3. Run with conda

  • (1) Install the conda environment named longmethyl (once for all).
# in a gpu machine, make sure there is already cuda10.0 and cuda driver in the machine
conda env create -f longmethyl/environment.yml
# or, in a cpu-only machine
conda env create -f longmethyl/environment-cpu.yml
  • (2) Install Guppy, since Guppy is not open-sourced, from ONT community (once for all).

  • (3) Download the pre-trained model of deepsignal for calling mods [check deepsignal CpG model-model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz in google drive].

  • (4) Run longmethyl using -profile conda and the longmethyl environment.

# activate nextflow environment
conda activate nextflow

# run longmethyl
nextflow run ~/tools/longmethyl -profile conda \
    --conda_name /home/nipeng/tools/miniconda3/envs/longmethyl \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz \
    --deepsignalDir model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz
# or, run longmethyl using GPU, set CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0 nextflow run ~/tools/longmethyl -profile conda \
    --conda_name /home/nipeng/tools/miniconda3/envs/longmethyl \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz \
    --deepsignalDir model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz

Extra 1. Run longmethyl and the benchmark process

If you want benchmark the ONT 5mCpG calling pipeline with something like BS-seq, set --eval_methcall as true and provide BS-seq results in bedmethyl format using --bs_bedmethyl:

nextflow run ~/tools/longmethyl -profile singularity \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz \
    --eval_methcall true \
    --bs_bedmethyl hg002_bsseq_chr20_demo.bed

Extra 2. Resume a run

Try -resume to re-run a failed job to save time:

nextflow run ~/tools/longmethyl -profile singularity \
    --dsname test \
    --genome chr20_demo.fa \
    --input fast5_chr20.tar.gz \
    -resume

Outputs

The output directory should look like the following:

longmethyl_results/
├── pipeline_info
│   ├── execution_report_2022-11-12_10-33-35.html
│   ├── execution_timeline_2022-11-12_10-33-35.html
│   ├── execution_trace_2022-11-12_10-33-35.txt
│   └── pipeline_dag_2022-11-12_10-33-35.svg
└── test-ds
    ├── test_deepsignal_eval_genomelevel.forplot.txt
    ├── test_deepsignal_eval_genomelevel.txt
    ├── test_deepsignal_eval_readlevel.txt
    ├── test_deepsignal_per_read_combine.tsv.gz
    └── test_deepsignal_sitemods_freq.bed.gz
  • pipeline_info: Information of the workflow execution, generated by nextflow automatically.
  • test-ds: methylation calling results
    • test_deepsignal_eval*: Read-level/genome-level evaluation results when --eval_methcall and --bs_bedmethyl is set.
    • test_deepsignal_per_read_combine.tsv.gz: Per-read methylation prediction
    • test_deepsignal_sitemods_freq.bed.gz: Genome-level methylation frequencies.

Acknowledgements

developement: nextflow_develop.md

TODO

  • add summary
  • test case with no basecall/resquiggle steps
  • --fast5out not necessary in basecall; tombo-anno split from tombo-resquiggle, and make it optional
  • dockerfile
  • cpu settings (do not use task.cpus for all process)
  • clean work dir
  • test with gpu (with docker, run with gpu and cpu cannot succeed in a single container, cause of guppy)
  • how to set a default deepsignal model
  • result_summary_statistics/for visualization?
  • add test demo, including benchmark and evaluation
  • test a 20x hg002 dataset
  • add deepsignal2
  • add multi_to_single step
  • vbz issue
  • update deepsignal?
  • try filelist/multi_inputs, modify code to enable running in parallel; learn more; how to enable parallel and aviod copying files many times at the same time
  • Does nextflow support cross-processes parallel (when processes have relationships in a DAG: like untar->basecall)? (maybe no)
  • add visualization (Rmarkdown/html?)
  • freq.bed to bedgraph/wig for visualization?

About

License:MIT License


Languages

Language:Python 45.4%Language:Nextflow 42.2%Language:R 5.3%Language:Dockerfile 4.5%Language:Shell 2.7%