m-mahgoub / nf-core-hic

Pipeline for Hi-C data analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hic

Pipeline for Hi-C/Capture-C data analysis

Nextflow run with docker run with singularity

Introduction

nf-core-hic is a bioinformatics best-practice analysis pipeline for Hi-C/Capture-C data analysis. This pipeline is optimal for large scale analysis in High Performance Computing Clusters (HPCs) and Cloud Computing environments (eg. AWS). Also, it can be executed in hybrid environments (eg. LSF/AWS hybrid run).

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. This workflow is based on nf-core template.

Workflow Summary

1) Hi-C Workflow (default)

  1. fastq2pair (per library):
    1. Preprocessing (fastp) >> library.html & library.json
    2. Alignment (bwa) >> library.cram/library.cram.crai
    3. Extract ligation junctions (pairtools)
    4. Remove PCR/optical duplicates (pairtools) >> library.pairs.gz & library.dedup.stats.txt
    5. Make pairs cram file (pairtools & samtools) >> library.pairs.cram/library.pairs.cram.crai
  2. Merge all library.pairs.gz & library.pairs.cram for libraries per individual sample >> sample.pairs.gz & sample.pairs.cram
  3. Make .mcool file (cooler) >> sample.mcool

2) Capture-C Workflow

- Initial steps Similar to Hi-C Workflow (steps 1-3)
4. QC for Capture (Baits regions coverage)
5. Make bam file compatible with CHiCAGO algorithm (samtools)

3) QC Workflow

  • This workflow is intended to check library Complexity from shallow-depth sequencing for QC before doing deep sequencing. it is based on this Dovetail tutorial.
  1. fastq2pair (per library): Same steps as in HiC and Capture-C workflows.
  2. Estimate library complexity (preseq) >> sample.preseq.txt. For interpretation of this results refer to Dovetail tutorial

Quick Start

  1. Install Nextflow (>=22.10.1)

  2. Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (this pipeline can NOT be run with conda)). This requirement is not needed for running the pipeline in WashU RIS cluster. This pipeline is also successfully tested using Amazon Cloud Computing (AWS). For details on how to run nextflow pipelines in AWS refer to nextflow documentation and to this excellent tutorial.

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run dhslab/nf-core-hic -profile test,YOURPROFILE(S) --outdir <OUTDIR>
  4. Start running your own analysis!

    1. Hi-C workflow (default)
    nextflow run dhslab/nf-core-hic -r dev -latest \
    -profile YOURPROFILE(S) \
    --input <SAMPLESHEET> \
    --fasta <FASTA> \
    --bwa_index <INDEX_PREFIX> \
    --chromsizes <CHROMSIZES> \
    --genome <GENOME_NAME> \
    --outdir <OUTDIR> 
    1. Capture-C workflow
    nextflow run dhslab/nf-core-hic -r dev -latest \
    -entry capture \
    -profile YOURPROFILE(S) \
    --input <SAMPLESHEET> \
    --fasta <FASTA> \
    --bwa_index <INDEX_PREFIX> \
    --chromsizes <CHROMSIZES> \
    --genome <GENOME_NAME> \
    --baits_bed <BAITS_BED> \
    --outdir <OUTDIR>
    1. QC workflow
    nextflow run dhslab/nf-core-hic -r dev -latest \
    -entry qc \
    -profile YOURPROFILE(S) \
    --input <SAMPLESHEET> \
    --fasta <FASTA> \
    --bwa_index <INDEX_PREFIX> \
    --chromsizes <CHROMSIZES> \
    --genome <GENOME_NAME> \
    --outdir <OUTDIR> 
  • any number of profiles/config-files can be used. Just consider how configuration priorities are set in nextflow as documented here

Usage

Required parameters:

  1. Input samplesheet.cvs which provides paths for fastq1, fastq2 raw reads and their metadata (id, sample, library, flowcell). this can be provided either in a configuration file or as --input path/to/samplesheet.cvs command line parameter. Example sheet located in assets/samplesheet.csv.
  2. Genome fasta, either in a configuration file or as --fasta path/to/genome.fasta command line parameter.
  3. BWA index, either in a configuration file or as --bwa_index path/to/bwa_index/with_prefix command line parameter. It is important to provide the full path including index prefix.
  4. Chromosome sizes file, either in a configuration file or as --chromsizes path/to/chromsizes command line parameter.
  5. Genome name (eg. hg38), either in a configuration file or as --fasta path/to/genome.fasta command line parameter.
  6. Capture-C Baits bed file (Only in Capture-C workflow) either in a configuration file or as --baits_bed path/to/baits_bed command line parameter.

Tools specific parameters:

The following parameters are set to the shown default values, but should be modified when required in command line, or in user-provided config files:

Preprocessing options for fastp

Parameter Description Type Default
trim_qual fastp -q option for the quality value that a base is qualified integer 15

pairtools options

Parameter Description Type Default
parsemq pairtools parse --min-mapq option for the minimal MAPQ score to consider a read as uniquely mapped integer 1
parse_walks_policy pairtools parse --walks-policy option. See pairtools documentation for details string 5unique
parse_max_gap pairtools parse --max-inter-align-gap option. See pairtools documentation for details integer 30
max_mismatch pairtools dedup --max-mismatch option. Pairs with both sides mapped within this distance (bp) from each other are considered duplicates integer 1

mcool file options

Parameter Description Type Default
resolutions cooler zoomify --resolutions option: Comma-separated list of target resolutions string 1000000,500000,250000,100000,50000,20000,10000,5000
min_res Minimum resolution for the mcool file (from the resolutions list provided) integer 5000
mcool_mapq_threshold Minimum resolution for the mcool file (from the resolutions list provided) string 1 30

Capture-C options

Parameter Description Type Default
baits_bed Bed file for regions targeted by Capture baits. Required only for Capture-C workflow string None

Directory tree for test run output (default workflow):

.
├── pipeline_info
│   ├── execution_report_2023-02-17_23-57-34.html
│   ├── execution_timeline_2023-02-17_23-57-34.html
│   ├── execution_trace_2023-02-17_23-57-34.txt
│   ├── pipeline_dag_2023-02-17_23-57-34.html
│   ├── samplesheet.valid.csv
│   └── software_versions.yml
└── samples
    └── TEST
        ├── fastq2pairs
        │   ├── TESTA
        │   │   ├── TESTA.cram
        │   │   ├── TESTA.cram.crai
        │   │   ├── TESTA.dedup.stats.txt
        │   │   ├── TESTA.fastp.html
        │   │   ├── TESTA.fastp.json
        │   │   ├── TESTA.pairs.cram
        │   │   ├── TESTA.pairs.cram.crai
        │   │   └── TESTA.pairs.gz
        │   ├── TESTB
        │   │   ├── TESTB.cram
        │   │   ├── TESTB.cram.crai
        │   │   ├── TESTB.dedup.stats.txt
        │   │   ├── TESTB.fastp.html
        │   │   ├── TESTB.fastp.json
        │   │   ├── TESTB.pairs.cram
        │   │   ├── TESTB.pairs.cram.crai
        │   │   └── TESTB.pairs.gz
        │   ├── TESTC
        │   │   ├── TESTC.cram
        │   │   ├── TESTC.cram.crai
        │   │   ├── TESTC.dedup.stats.txt
        │   │   ├── TESTC.fastp.html
        │   │   ├── TESTC.fastp.json
        │   │   ├── TESTC.pairs.cram
        │   │   ├── TESTC.pairs.cram.crai
        │   │   └── TESTC.pairs.gz
        │   └── merged
        │       ├── TEST.pairs.cram
        │       ├── TEST.pairs.cram.crai
        │       └── TEST.pairs.gz
        └── mcool
            ├── TEST.mapq_1.mcool
            └── TEST.mapq_30.mcool


Notes:

  • The pipeline is developed and optimized to be run in WashU RIS (LSF) HPC, but could be deployed in any HPC environment supported by Nextflow.
  • The pipeline does NOT support conda.
  • The Test workflow can be run on personal computer, but is not advised. It is recommended to do the testing in environment with at least 16 GB memory. If the test workflow failed (especially at fastq2pair step ), try re-run with more allocated resources. Such errors are likely because of broken pipes due to maxed-out memory. The pipeline is designed with several pipe steps to avoid making large intermediate files.

About

Pipeline for Hi-C data analysis

License:MIT License


Languages

Language:Nextflow 60.4%Language:Groovy 35.4%Language:Python 2.2%Language:HTML 1.9%Language:Shell 0.1%