qiubio/chipseq is a bioinformatics analysis pipeline used for Chromatin ImmunopreciPitation sequencing (ChIP-seq) data based on nfcore/chipseq.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
This pipeline will generate the UCSC genome browser track hub and metagene analysis resuls in addition to original output of nfcore/chipseq pipeline.
The most important change for this pipeline from nfcore/chipseq pipeline is that I tried to improve the reproducibility of the pipeline depend on conda but not docker for following 2 reasons:
-
I can not use docker in our cluster.
-
The memory required for the pipeline is too heavy for personal computer if using docker.
However, conda always throw errors when create environment even I use modules. I add conda_softwares section in module.conf setting to make the pipeline more flexible to figure out this issue. I also changed the R/Biocondactor package installation methods from conda installation to BiocManager installation, which will be much slower than conda installation. The reason for that is because lots of package in conda is malfunction. By using BiocManager to avoid the dependece issues.
- Raw read QC (
FastQC
) - Adapter trimming (
Trim Galore!
) - Alignment (
BWA
) - Mark duplicates (
picard
) - Merge alignments from multiple libraries of the same sample (
picard
)- Re-mark duplicates (
picard
) - Filtering to remove:
- reads mapping to blacklisted regions (
SAMtools
,BEDTools
) - reads that are marked as duplicates (
SAMtools
) - reads that arent marked as primary alignments (
SAMtools
) - reads that are unmapped (
SAMtools
) - reads that map to multiple locations (
SAMtools
) - reads containing > 4 mismatches (
BAMTools
) - reads that have an insert size > 2kb (
BAMTools
; paired-end only) - reads that map to different chromosomes (
Pysam
; paired-end only) - reads that arent in FR orientation (
Pysam
; paired-end only) - reads where only one read of the pair fails the above criteria (
Pysam
; paired-end only)
- reads mapping to blacklisted regions (
- Alignment-level QC and estimation of library complexity (
picard
,Preseq
) - Create normalised bigWig files scaled to 1 million mapped reads (
BEDTools
,bedGraphToBigWig
) - Generate gene-body meta-profile from bigWig files (
deepTools
) - Calculate genome-wide IP enrichment relative to control (
deepTools
) - Calculate strand cross-correlation peak and ChIP-seq quality measures including NSC and RSC (
phantompeakqualtools
)
- Re-mark duplicates (
- Call broad/narrow peaks
- By (
MACS2
)- Call Peaks
- Differentail binding analysis by shell script
- Annotate peaks relative to gene features (
HOMER
) - Create consensus peakset across all samples and create tabular file to aid in the filtering of the data (
BEDTools
) - Count reads in consensus peaks (
featureCounts
) - Differential binding analysis, PCA and clustering (
R
,DESeq2
)
- Annotate peaks relative to gene features (
- Differential binding analysis by
DiffBind
- By (
HOMER
)- Call Peaks
- Differential binding analysis by
DiffBind
- Annotate peaks relative to gene features (
ChIPpeakAnno
) - Enrichment analysis. If gsea-cli.sh and c2.all.v7.2 molecular signatures database are available, GSEA enrichment analysis will also be done.
- By (
- Visualisation the tracks.
- Present QC for raw read, alignment, peak-calling and differential binding results (
MultiQC
,R
) - Create index.html (
R
)
# install nextflow
wget -qO- https://get.nextflow.io | bash
nextflow pull jianhong/chipseq
nextflow run jianhong/chipseq -profile test,conda
nextflow pull jianhong/chipseq
nextflow drop jianhong/chipseq
To make sure DiffBind to be run, antibody must be provided.
The control could be empty. track_
columes are optional.
save the design table as a csv file. See samples in assets/design*.
group | replicate | fastq_1 | fastq_2 | antibody | control | track_color | track_group |
---|---|---|---|---|---|---|---|
WT | 1 | fastq/WT1.fastq.gz | ANT1 | Input | #E69F00 | SAMPLE | |
WT | 2 | fastq/WT2.fastq.gz | ANT1 | Input | #E69F00 | SAMPLE | |
KD | 1 | fastq/KD1.fastq.gz | ANT1 | Input | #0000FF | SAMPLE | |
KD | 2 | fastq/KD2.fastq.gz | ANT1 | Input | #0000FF | SAMPLE | |
Input | 1 | fastq/KD1.fastq.gz | #000000 | SAMPLE | |||
Input | 2 | fastq/KD2.fastq.gz | #000000 | SAMPLE |
The pipeline can do metagene analysis for predefined bed files.
nextflow run jianhong/chipseq -profile test -resume --genomicElements beds/*.bed
Here is the example profile file named as sample.config for human samples.
params {
config_profile_name = 'Full test profile'
config_profile_description = 'Full test dataset to check pipeline function'
// Input data
input = 'path_to_design.csv'
// Genome references
genome = 'GRCh38'
}
And run the pipeline by:
nextflow run jianhong/chipseq -c sample.config --conda
Or by docker:
nextflow run jianhong/chipseq -c sample.config --docker
Here is the example to download data from GEO database and run analysis. The downloader will filter the seqtype by ChIP-Seq. You can reset the seqtype by --seqtype parameter.
nextflow run jianhong/chipseq --input GSE36107 --genome dm6
To breakdown the limitation, you may want to add E-utilities api_key. Let’s say that you create a key and its value is “ABCD123”. NOTE: docker version not support this.
nextflow run jianhong/chipseq --input GSE90661 --genome R64-1-1 --api_key ABCD123
First create a config file following this format:
params {
// change conda software version.
conda_softwares {
samtools = "bioconda::samtools=1.09"
trimgalore = "bioconda::cutadapt=1.18 bioconda::trim-galore=0.6.6"
}
// change module parameters, for example for homer findpeaks and macs2
modules {
'homer_findpeaks' {
args = ['h3k4me1': "-region -size 1000 -minDist 2500 -C 0",
'h3k4me3': "-region -nfr",
'h3k9me1': "-region -size 1000 -minDist 2500 -C 0",
'h3k9me2': "-region -size 1000 -minDist 2500 -C 0",
'h3k9me3': "-region -size 1000 -minDist 2500 -C 0",
'h3k14ac': "-region -size 1000 -minDist 2500 -C 0",
'h4k20me1': "-region -size 1000 -minDist 2500 -C 0",
'h3k27me3': "-region -size 1000 -minDist 2500 -C 0",
'h3k27ac': "-region -nfr",
'h3k36me3': "-region -size 1000 -minDist 2500 -C 0",
'h3k79me2': "-region -size 1000 -minDist 2500 -C 0",
'h3k79me3': "-region -size 1000 -minDist 2500 -C 0",
'others': ""]
publish_dir = "homer"
}
'macs2_callpeak' {
args = "--keep-dup all"
publish_dir = "macs2"
}
}
}
To get more modules setting, please refer to modules.config file
And then run the pipeline as following command:
nextflow run jianhong/chipseq -c sample.config -c path/to/your/module/config/file --conda -resume
Please create an issue to submit your questions.
See citation