hic

Pipeline for Hi-C/Capture-C data analysis

Introduction

nf-core-hic is a bioinformatics best-practice analysis pipeline for Hi-C/Capture-C data analysis. This pipeline is optimal for large scale analysis in High Performance Computing Clusters (HPCs) and Cloud Computing environments (eg. AWS). Also, it can be executed in hybrid environments (eg. LSF/AWS hybrid run).

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. This workflow is based on nf-core template.

Workflow Summary

1) Hi-C Workflow (default)

fastq2pair (per library):
1. Preprocessing (fastp) >> library.html & library.json
2. Alignment (bwa) >> library.cram/library.cram.crai
3. Extract ligation junctions (pairtools)
4. Remove PCR/optical duplicates (pairtools) >> library.pairs.gz & library.dedup.stats.txt
5. Make pairs cram file (pairtools & samtools) >> library.pairs.cram/library.pairs.cram.crai
- These steps are based on this Dovetail tutorial. Check the link for more details
Merge all library.pairs.gz & library.pairs.cram for libraries per individual sample >> sample.pairs.gz & sample.pairs.cram
Make .mcool file (cooler) >> sample.mcool

2) Capture-C Workflow

- Initial steps Similar to Hi-C Workflow (steps 1-3)
4. QC for Capture (Baits regions coverage)
5. Make bam file compatible with CHiCAGO algorithm (samtools)

3) QC Workflow

This workflow is intended to check library Complexity from shallow-depth sequencing for QC before doing deep sequencing. it is based on this Dovetail tutorial.

fastq2pair (per library): Same steps as in HiC and Capture-C workflows.
Estimate library complexity (preseq) >> sample.preseq.txt. For interpretation of this results refer to Dovetail tutorial

Quick Start

Install Nextflow (>=22.10.1)
Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (this pipeline can NOT be run with conda)). This requirement is not needed for running the pipeline in WashU RIS cluster. This pipeline is also successfully tested using Amazon Cloud Computing (AWS). For details on how to run nextflow pipelines in AWS refer to nextflow documentation and to this excellent tutorial.

Download the pipeline and test it on a minimal dataset with a single command:

nextflow run dhslab/nf-core-hic -profile test,YOURPROFILE(S) --outdir <OUTDIR>

Start running your own analysis!

Hi-C workflow (default)

nextflow run dhslab/nf-core-hic -r dev -latest \
-profile YOURPROFILE(S) \
--input <SAMPLESHEET> \
--fasta <FASTA> \
--bwa_index <INDEX_PREFIX> \
--chromsizes <CHROMSIZES> \
--genome <GENOME_NAME> \
--outdir <OUTDIR>

Capture-C workflow

nextflow run dhslab/nf-core-hic -r dev -latest \
-entry capture \
-profile YOURPROFILE(S) \
--input <SAMPLESHEET> \
--fasta <FASTA> \
--bwa_index <INDEX_PREFIX> \
--chromsizes <CHROMSIZES> \
--genome <GENOME_NAME> \
--baits_bed <BAITS_BED> \
--outdir <OUTDIR>

QC workflow

nextflow run dhslab/nf-core-hic -r dev -latest \
-entry qc \
-profile YOURPROFILE(S) \
--input <SAMPLESHEET> \
--fasta <FASTA> \
--bwa_index <INDEX_PREFIX> \
--chromsizes <CHROMSIZES> \
--genome <GENOME_NAME> \
--outdir <OUTDIR>

any number of profiles/config-files can be used. Just consider how configuration priorities are set in nextflow as documented here

Usage

Required parameters:

Input samplesheet.cvs which provides paths for fastq1, fastq2 raw reads and their metadata (id, sample, library, flowcell). this can be provided either in a configuration file or as --input path/to/samplesheet.cvs command line parameter. Example sheet located in assets/samplesheet.csv.
Genome fasta, either in a configuration file or as --fasta path/to/genome.fasta command line parameter.
BWA index, either in a configuration file or as --bwa_index path/to/bwa_index/with_prefix command line parameter. It is important to provide the full path including index prefix.
Chromosome sizes file, either in a configuration file or as --chromsizes path/to/chromsizes command line parameter.
Genome name (eg. hg38), either in a configuration file or as --fasta path/to/genome.fasta command line parameter.
Capture-C Baits bed file (Only in Capture-C workflow) either in a configuration file or as --baits_bed path/to/baits_bed command line parameter.

Tools specific parameters:

The following parameters are set to the shown default values, but should be modified when required in command line, or in user-provided config files:

Preprocessing options for fastp

Parameter	Description	Type	Default
`trim_qual`	fastp `-q` option for the quality value that a base is qualified	`integer`	15

pairtools options

Parameter	Description	Type	Default
`parsemq`	pairtools parse `--min-mapq` option for the minimal MAPQ score to consider a read as uniquely mapped	`integer`	1
`parse_walks_policy`	pairtools parse `--walks-policy` option. See pairtools documentation for details	`string`	5unique
`parse_max_gap`	pairtools parse `--max-inter-align-gap` option. See pairtools documentation for details	`integer`	30
`max_mismatch`	pairtools dedup `--max-mismatch` option. Pairs with both sides mapped within this distance (bp) from each other are considered duplicates	`integer`	1

mcool file options

Parameter	Description	Type	Default
`resolutions`	cooler zoomify `--resolutions` option: Comma-separated list of target resolutions	`string`	1000000,500000,250000,100000,50000,20000,10000,5000
`min_res`	Minimum resolution for the mcool file (from the resolutions list provided)	`integer`	5000
`mcool_mapq_threshold`	Minimum resolution for the mcool file (from the resolutions list provided)	`string`	1 30

Capture-C options

Parameter	Description	Type	Default
`baits_bed`	Bed file for regions targeted by Capture baits. Required only for Capture-C workflow	`string`	None

Directory tree for test run output (default workflow):

.
├── pipeline_info
│   ├── execution_report_2023-02-17_23-57-34.html
│   ├── execution_timeline_2023-02-17_23-57-34.html
│   ├── execution_trace_2023-02-17_23-57-34.txt
│   ├── pipeline_dag_2023-02-17_23-57-34.html
│   ├── samplesheet.valid.csv
│   └── software_versions.yml
└── samples
    └── TEST
        ├── fastq2pairs
        │   ├── TESTA
        │   │   ├── TESTA.cram
        │   │   ├── TESTA.cram.crai
        │   │   ├── TESTA.dedup.stats.txt
        │   │   ├── TESTA.fastp.html
        │   │   ├── TESTA.fastp.json
        │   │   ├── TESTA.pairs.cram
        │   │   ├── TESTA.pairs.cram.crai
        │   │   └── TESTA.pairs.gz
        │   ├── TESTB
        │   │   ├── TESTB.cram
        │   │   ├── TESTB.cram.crai
        │   │   ├── TESTB.dedup.stats.txt
        │   │   ├── TESTB.fastp.html
        │   │   ├── TESTB.fastp.json
        │   │   ├── TESTB.pairs.cram
        │   │   ├── TESTB.pairs.cram.crai
        │   │   └── TESTB.pairs.gz
        │   ├── TESTC
        │   │   ├── TESTC.cram
        │   │   ├── TESTC.cram.crai
        │   │   ├── TESTC.dedup.stats.txt
        │   │   ├── TESTC.fastp.html
        │   │   ├── TESTC.fastp.json
        │   │   ├── TESTC.pairs.cram
        │   │   ├── TESTC.pairs.cram.crai
        │   │   └── TESTC.pairs.gz
        │   └── merged
        │       ├── TEST.pairs.cram
        │       ├── TEST.pairs.cram.crai
        │       └── TEST.pairs.gz
        └── mcool
            ├── TEST.mapq_1.mcool
            └── TEST.mapq_30.mcool

Notes:

The pipeline is developed and optimized to be run in WashU RIS (LSF) HPC, but could be deployed in any HPC environment supported by Nextflow.
The pipeline does NOT support conda.
The Test workflow can be run on personal computer, but is not advised. It is recommended to do the testing in environment with at least 16 GB memory. If the test workflow failed (especially at fastq2pair step ), try re-run with more allocated resources. Such errors are likely because of broken pipes due to maxed-out memory. The pipeline is designed with several pipe steps to avoid making large intermediate files.

m-mahgoub / nf-core-hic

hic