Step-by-step instructions for how to analyze ChIP-Seq data starting from raw FASTQ files and ending with analysis-ready peak files.
Most of the work in the pipeline is done by various scripts in the code
directory (created by setup.sh
). These scripts will be submitted to the Slurm job scheduler in "batch jobs" using the scripts located in the sbatch
directory. Each "code" script has a corresponding "sbatch" script with the same numeric prefix (e.g. code/00_process.md5.R
and code/sbatch/00_sbatchmd5.sh
).
A majority of the sbatch scripts are "array jobs", which means that one job is sent to the scheduler per file. If a tool is an array job, it will have #SBATCH --array 1-n
as the last #SBATCH
argument at the top of the script, where n is the number of individual files that need to be processed. Additionally, the output and error files created for each job will end in _%A_%a.out
or _%A_%a.err
for array jobs and _%j.out
or _%j.err
for non-array jobs.
Each job submission will create an output and an error file that have a descriptive prefix denoting the tool it came from, along with the job or array id. These files will be generated in whichever directory you are currently in when you submit to Slurm. There is a qc directory with sub-directories for each step in the pipeline that are designed to hold these log files.
In addition to the main processing steps, there are also some qc scripts that are housed in the code/qc
directory. You will see them mentioned throughout the steps of the pipeline. They mainly parse different log outputs from the main tools and create qc plots for review.
There are also a number of analysis scripts that have yet to be generalized and also have not been designed to be submitted to Slurm. They're located in code/analysis
.
-
Make sure that you have the following in your .bashrc, .bash_profile, or equivalent:
export sdata="/path/to/LIBXXXXXMS" export stool="/path/to/this/installation" export BIOCODERS="/path/to/BioCoders/" export R_LIBS_USER="$BIOCODERS/InstalledLibraries/R" export PATH="$PATH:$BIOCODERS/Applications/anaconda2/bin"
-
Double-check that R is pointing to the right directory by running the following:
~$ R > .libPaths() [1] "/home/exacloud/lustre1/BioCoders/InstalledLibraries/R"
The path listed above should be the first result. You will likely have multiple paths listed after this one.
-
Run
sh $stool/setup.sh
to create empty directories. -
Follow directions from MPSSR to transfer files from nix.
- FastQC goes in $sdata/data/FastQC.
- Reports, Stats, readme.txt go in $sdata/data/extras.
- Fastq files go in $sdata/data/00_fastqs.
-
Check Fastq transfer using md5 sums.
~$ sbatch $sdata/code/sbatch/00_sbatchmd5.sh ~$ cd $sdata/data/00_fastqs ~$ diff calculated.md5.sums.txt md5sum.sorted.txt ~$ mv calculated.md5.sums.txt md5sum.sorted.txt md5sum.txt $sdata/data/extras/md5 ~$ mv $sdata/code/sbatch/md5_* $sdata/logs/00_md5
-
Run multiqc on FastQC files.
~$ MULTIQC=$BIOCODERS/Applications/anaconda2/bin/multiqc ~$ $MULTIQC $sdata/data/FastQC <copy files to local drive to view>
-
Unzip. (Not necessary, but useful to take a look at fastq files if you want).
~$ sbatch $sdata/code/sbatch/01_sbatchUnzip.sh ~$ mv $sdata/code/sbatch/unzip_* $sdata/logs/01_unzip
-
Trim adapter sequence, reformat log files, and make plots NEED TO GENERALIZE A BIT MORE! Notes: For the "01_processTrimLog.sh step - be sure to check the file name indicators (eg, "L005" may not be present in your file names!!). Also, Check that your files are in the indicated directories for the 02_trimViz.R step.
~$ sbatch $sdata/code/sbatch/02_sbatchTrimSeq.sh ~$ mv $sdata/data/01_trim/*_report.txt $sdata/data/02_trimLog ~$ cd $sdata/02_trimLog ~$ for file in *_report.txt; do name=${file%%_L005*}; sh $sdata/code/qc/01_processTrimLog.sh $file $name trimLogProcessed/; done ~$ Rscript $sdata/code/qc/02_trimViz.R --summaryDir $sdata/data/qc/trimLogProcessed/summary/ --trimDistDir $sdata/data/qc/trimLogProcessed/trimDist --meta $sdata/meta/meta.txt --outDir $sdata/data/qc/plots/trimQC/
-
Create bowtie index.
## Double check that $IN is the appropriate path ## Double check that $BASE is appropriate for your genome ~$ sbatch $sdata/code/sbatch/03_sbatchBowtieBuild.sh
-
Run bowtie2.
~$ sbatch $sdata/code/sbatch/10_sbatchBowtie.sh ~$ mv $sdata/code/sbatch/bowtie2_* $sdata/logs/10_bowtie
-
Convert sam files to bam.
~$ sbatch $sdata/code/sbatch/20_sbatchSam2Bam.sh ~$ mv $sdata/code/sbatch/s2b_* $sdata/logs/20_s2b
-
Filter data and get some QC. info.
- Split - split into unmapped, multi-mapped, and uniquely-mapped. Further split into good and bad reads via MAPQ score.
- unmapped - use
-f 4
- multi-mapped - use
-F 4
and grep for "XS:i:" - unique-mapped - use
-F 4
and inverse grep for "XS:i:"
- unmapped - use
- MapQ - print mapQ scores for each alignment (5th column of bam file)
~$ sbatch $sdata/code/sbatch/30_sbatchFilterQC.sh ~$ mv $sdata/code/sbatch/filterQC_* $sdata/logs/30_filter_and_qc
- Aggregate/reformat QC files
~$ Rscript $sdata/code/qc/10_bowtie_alignment_qc.R --inputDir $sdata/logs/10_bowtie --outDir $sdata/data/qc/summary/ ~$ for dir in `ls $sdata/data/qc/*_mapq`; do Rscript $sdata/code/qc/11_bowtie_mapqDistr.R -i $sdata/data/qc/$dir -o $sdata/data/qc/summary -f "2,3,4,5"
- Make plots
~$ Rscript $sdata/code/qc/12_plot_alignment_qc.R --inputFile $sdata/data/qc/summary/bowtie2.alignment.QC.summary.txt \ --outDir $sdata/data/qc/plots/alignQC/ \ --treat 1 --type 2 --rep 3 ~$ Rscript $sdata/code/qc/13_mapq_alignment_qc.R --uniqInputFile $sdata/data/qc/summary/uniq_mapq_summary.txt \ --multiInputFile $sdata/data/qc/summary/multi_mapq_summary.txt \ --treat 1 --type 2 --rep 3 --cutOff 10 --outDir $sdata/data/qc/plots/alignQC/
- Split - split into unmapped, multi-mapped, and uniquely-mapped. Further split into good and bad reads via MAPQ score.
-
Mark duplicates with picard tools, reformat log files for plotting, and plot.
~$ sbatch $sdata/code/sbatch/40_sbatchRemDup.sh ~$ mv $sdata/code/sbatch/remDup_* $sdata/logs/40_remDup ~$ sh $sdata/code/qc/20_processDupLog.sh $sdata/data/41_remDupLog $sdata/data/qc/summary ~$ Rscript $sdata/code/qc/21_plot_markDup_qc.R --inputFile $sdata/data/qc/summary/dupSummary.txt \ --outDir $sdata/data/qc/plots/alignQC/ \ --treat 1 --type 2 --rep 3
-
Call peaks using MACS2 (following instructions from: https://github.com/taoliu/MACS/wiki/Build-Signal-Track)
-B
tells MACS2 to store fragment pileup scores in bedGraph files.--SPMR
tells MACS2 to generate pileup signal of 'fragment pileup per million reads'.--qvalue 0.05
is the default. Included for ease of memory. Uses Benjamini-Hochberg adjustment of p-values. Minimum cutoff to call significant regions.--gsize hs
is for mappable genome size of humans. Set to 'mm' for mouse.
-
50_sbatchCallPeaks.sh will also create a "bed" version that prepends "chr" to the chromosome column and removes non-standard chromosomes.
### Create todo files ~$ ls -v $sdata/data/40_remDup | grep -v Input > $sdata/todo/50_callPeaks.txt ~$ ls -v $sdata/data/40_remDup | grep Input > $sdata/todo/50_ctl.txt ### Run ~$ sbatch $sdata/code/sbatch/50_sbatchCallPeaks.sh ~$ mv $sdata/code/sbatch/callPeaks_* $sdata/logs/50_callPeaks
-
Count peaks to check all quality of samples.
~$ sh $sdata/code/qc/30_countPeaks.sh $sdata/data/50_peaks $sdata/data/qc/summary
-
Run MACS2 again, this time with
bdgcmp
instead ofcallpeak
macs2 bdgcmp
will 'deduct noise by comparing two signal tracks in bedGraph'
~$ sbatch $sdata/code/sbatch/51_callPeaksBDGCMP.sh` ~$ mv $sdata/code/sbatch/callPeaks_BDGCMP_* $sdata/logs/51_callPeaksBDGCMP
-
Convert bedGraph files to bigWig files
- A few extra scripts are required (located in
public/
). See above link for more detailed instruction.
~$ sbatch $sdata/code/sbatch/52_sbatchBdg2bw.sh ~$ mv $sdata/code/sbatch/bdg2bw_* $sdata/logs/52_bdg2bw
- A few extra scripts are required (located in
-
Run correlation on bigWig files to determine if replicates are good enough to combine.
- Copy
$sdata/todo/50_callPeaks.txt
to$sdata/todo/53_wigCorrelate.txt
- Reformat so that all of the samples for each treatment are on a single line, with each file separated by a space.
- Additionally, must change suffix to be Fold Enrichment bigWig file
- Example:
~$ cat $sdata/todo/50_callPeaks.txt DNA180319MS_CM_IP_1_S28.bam DNA180319MS_CM_IP_2_S29.bam DNA180319MS_CM_IP_3_S30.bam ~$ cat $sdata/todo/53_wigCorrelate.txt DNA180319MS_CM_IP_1_S28_FE.bw DNA180319MS_CM_IP_2_S29_FE.bw DNA180319MS_CM_IP_3_S30_FE.bw
- Run:
~$ sbatch $sdata/code/sbatch/53_sbatchWigCorrelate.sh ~$ mv $sdata/code/sbatch/wigCorrelate_* $sdata/logs/53_wigCorrelate
- Copy
-
Check the output files and create signal tracks if appropriate.
- Copy
$sdata/todo/53_wigCorrelate.txt
to$sdata/todo/54_signalTrack.txt
- Change suffix back to original bam file rather than _FE.bw
~$ sbatch $sdata/code/sbatch/54_sbatchSignalTracks.sh ~$ mv $sdata/code/sbatch/signalTrack_* $sdata/logs/54_signalTrack
- If you make signalTracks, you can also make the bedGraph and bigWig files. Use
55_sbatchSignalTrackBDGCMP.sh
the same way as51_sbatchCallPeaksBDGCMP.sh
and use56_sbatchSignalTrackBdg2bw.sh
the same way as52_sbatchBdg2bw.
- Note that you will have to make a new todo file. It should contain the "basenames" of each signalTrack.
- Copy
-
Run idr on samples as well.
- Copy
sdata/todo/54_signalTrack.txt
tosdata/todo/60_idr.txt
- Change suffixes to
_peaks.narrowPeak
rather than.bam
~$ sbatch $sdata/code/sbatch/60_sbatchIDR.sh ~$ mv $sdata/code/sbatch/idr_* $sdata/logs/60_idr
- Check the
.err
files to see IDR results.
- Copy
A lot of different files have been produced. Now to review what everything is and what its potential purpose is.
This directory contains the original output of the MACS2 peak calling step. There are a few different file formats that contain essentially the same information, along with some QC information. Data in this directory can be used to visualize peaks in a genome browser, but if you ran the signal track steps above, those results are better to use for that purpose.
- [sample]_control_lambda.bdg
- bedGraph of control peak windows for determining lambda
- Chromosome name
- Start of window
- End of window
- Maximum local lambda. Estimated using
- extsize
- slocal
- llocal
- lambda is expected number of reads in window, so the "control lambda" is basically the expected noise
- View this file along with the [sample]_treat_pileup.bdg to compare the treated peaks against the control noise.
- bedGraph of control peak windows for determining lambda
- [sample]_model.r
- Run this script to produce 'model shift size' and 'cross correlation' plots based on MACS2 run
- Generates files:
- [sample]_peakModel.pdf (model shift size)
- [sample]_crossCor.pdf (cross correlation)
- [sample]_peaks.narrowPeak
- BED6+4 with peak locations and summit
- Chromosome name
- Start position of peak (0-based)
- End position of peak
- Peak name
- Integer score
int(-10*log10(qvalue))
- Strand (I think)
- Fold enrichment for peak summit
- -log10(pvalue) for peak summit
- -log10(qvalue) for peak summit
- Relative summit position to peak start
- Able to load directly to UCSC genome browser
- BED6+4 with peak locations and summit
- [sample]_peaks.xls
- Contains information about called peaks. One line per peak, plus header lines
- Chromosome name
- Start position of peak (1-based)
- End position of peak
- Length of peak region
- Absolute peak summit position
- pileup height at peak summit
- -log10(pvalue) for the peak summit
- Fold enrichment for the peak summit
- Enrichment is compared against random Poisson distribtuion with local lambda
- -log10(qvalue) of peak summit
- name of peak
- NOTE THAT XLS COORDINATES ARE 1-BASED, WHICH IS DIFFERENT THAN BED'S O-BASED
- Contains information about called peaks. One line per peak, plus header lines
- [sample]_summits.bed
- BED file with peak summit location for each peak
- Chromosome name
- Start position of summit (0-based). Will be
(narrowPeak start) + (narrowPeak relative summit position)
- End position of summit. Will be one more than start position.
- Peak name
- -log10(qvalue) of peak summit
- BED file with peak summit location for each peak
- [sample]_treat_pileup.bdg
- bedGraph file of treatment peak windows
- Chromosome name
- Start of window
- End of window
- Pileup score
- Scaled up or down relative to the control sample
- View this in IGV or UCSC browser and compare with the control sample
- bedGraph file of treatment peak windows
This directory contains the exact same information as can be found in the [sample]_peaks.narrowPeak files in 50_peaks. The only difference is that the 1st column now has "chr" prepended to the chromosome number, which is required for some downstream tools.
The bdgcmp subcommand is designed to generate noise-subtracted tracks. The MACS2 developer explains a little about it here. These files are also good to view in a genome browser. The 52_bw directory contains the same information as that in 51_bdgcmp, except in the smaller bigWig binary format instead. It's recommended to use these files for viewing, since they are the easiest to transfer from exacloud.
- [sample]_FE.bdg
- linear Fold Enrichment
- Simple descriptive measurement of difference between ChIP and control
- Can introduce high variability at low signals
- [sample]_logLR.bdg
- log10 likelihood ratio between ChIP and control.
- Based on dynamic poisson model
- statistical evaluation of enrichment.
There is nothing to use in this directory as far as downstream application. Each output file lists the input files used for the correlation as well as the corrletion score. You will have already looked at these scores to determine whether or not to proceed with the signal track construction.
This directory contains the exact same file types as described in 50_peaks, 51_bdgcmp, and 52_bw. There is now one file for each treatment/group and all replicates have been combined into that file. These peaks are much higher confidence than those in the individual files. Use these results for visualization.
IDR (Irreproducible Discovery Rate) is used to measure the reproducibility of results from replicate experiments, described here in detail. If the --plot
option is selected, a few QC plots will be created in addition to the consensus peak files.
- [sample]_idr
- modified BED file (20 total columns)
- Chromosome name
- Start position of peak (0-based)
- End position of peak
- Name given to a region. '.' is used if nothing assigned. (my results have '.')
- Scaled IDR value:
min(int(log2(-125*IDR)), 1000)
- IDR of 0 corresponds to score of 1000
- IDR of 0.05 corresponds to 540
- IDR of 1 corresponds to 0
- strand (+, -, .)
- signal value. Measurement of enrichment for the region for merged peaks
- Merged peak p-value
- Merged peak q-value
- Merged peak summit
- local IDR Value: -log10(localIDR)
- global IDR Value: -log10(globalIDR)
- rep1 start position of peak. Shifted based on offset
- rep1 end position of peak
- rep1 signal measure
- If
--rank
option is set tosignal.value
, then this value will be the same as col7 of rep1's narrowPeak file - If
--rank
option is set top.value
, then it will be the same as col8 of rep1's narrowPeak file
- If
- rep1 summit value
- rep2 start, end, signal, summit
- repN start, end, signal, summit
- modified BED file (20 total columns)
- [sample]_idr.png
- 4 different plots, see link above for full description.