2018 Danforth analysis
This repository contains scripts to download data, run analyses, and recreate many of the figures from our 2018 Danforth analysis.
General organization
This has been tested on a Linux server. Pipelines are meant to be used with drmr (can be downloaded from the Parker Lab GitHub). The general organization of the repo is as follows:
src
,bin
, andsample_information
directories should be self-explanatory.data
directory contains our raw data and other 'generic' data such as fasta files, bwa indices, etc. (some elements are distributed in the repository itself, some are created/downloaded in the steps outlined below).control
directory contains the actual analysis scripts (setting up subdirectories in thework
directory, and in many cases producingpipeline
files that end up in thework
directory).sw
contains some third-party software (RNA-Enrich).- The
work
directory will be created as themake
commands outlined below are run. Some pipeline files will be printed into there, and the actual analysis results will be put in there (e.g. bam files, DESeq2 results, etc). - The
figures
directory will be created as themake
commands outlined below are run. If figures are recreated using themake
commands outlined later, they will appear in here (along with some necessary intermediate files).
There is a Makefile
in the top level of the repository as well. All the make
commands mentioned below refer to this Makefile
and should therefore be run from the top directory.
Dependencies not included in repo
As stated, many of the pipelines for this analysis utilize drmr
to submit jobs to a resource manager (we use SLURM). Therefore, drmr
will need to be downloaded from the Parker Lab GitHub. Also, python (we used v. 2.7.13) and R (v. 3.3.3) will need to be present on the system.
Tools run from the command line
These tools must be in your $PATH:
- cta (v. 0.1.2; can be downloaded from the Parker Lab GitHub)
- fastqc (v. 0.11.5)
- bwa (v. 0.7.15-r1140)
- picard (v. 2.8.1)
- ataqv (v. 1.0; can be downloaded from the Parker Lab GitHub)
- STAR (v. 2.5.2b)
- QoRTs (v. 1.0.7)
- bnMapper.py
- macs2 (v. 2.1.1.20160309)
- samtools (v. 1.3.1, using htslib 1.3.2)
- bedtools (v. 2.26.0)
- blat (v. 36x2)
- SRA toolkit (for fastq-dump; v. 2.8.1)
- fastx_trimmer (FASTX Toolkit v. 0.0.14)
- mysql (for querying UCSC; v. 15.1 Distrib 10.1.26-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2)
- phantompeakqualtools (for ChIP-seq QC; v. 2.0)
R packages
- DESeq2 (v. 1.14.1)
- dplyr (v. 0.7.4)
- tidyr (v. 0.7.0)
- ggplot2 (v. 2.2.1)
- ggrepel (v. 0.6.5)
- optparse (v. 1.4.4)
- vsn (v. 3.42.3)
- biomaRt (v. 2.30.0)
- cowplot (v. 0.8.0)
- QoRTs (v. 1.1.8)
Additional python modules
- pybigwig (v. 0.3.9)
- pysam (v. 0.11.2.1)
- numpy (v. 1.13.1)
Setup
- Clone the repository
- Set your environmental variable $DANFORTH_HOME to the repository path (of course, you may want to add each of these
export
commands to your.bashrc
so that you don't need to re-set them each time you log out and back in to the server):
export DANFORTH_HOME='/path/to/repo'
- Add the src directory to your $PYTHONPATH and the bin directory to your $PATH:
export PYTHONPATH="$PYTHONPATH:${DANFORTH_HOME}/src"
export PATH="$PATH:${DANFORTH_HOME}/bin"
- If interested in the ChIP-seq analyses, set a variable representing the path to phantompeakqualtool's
run_spp.R
script:
export RUN_SPP_PATH="/path/to/run_spp.r"
- If interested in the RNA-seq analyses, set variables representing the path to QoRTs files:
export QORTS_JAR="/path/to/QoRTs.jar"
export QORTS_GEN_MULTI_QC="/path/to/qortsGenMultiQC.R"
and set a variable to indicate which genome the RNA-seq reads should be mapped to (mm9 for all analyses discussed in the manuscript; this variable is present only because it is needed in the case that one wants to recreate Fig. S1):
export DANFORTH_RNASEQ_GENOME="mm9"
- Install a small included R package for linking peaks to their nearest TSS:
make setup
- Prepare the
data
directory, containing all generic data (e.g., fasta files, bwa indices, etc). When the GEO repository becomes public, this will also download our ATAC-seq/RNA-seq fasta files from GEO (for now, this will obviously not happen):
make data
This will run most commands necessary to set up the data/ directory immediately in a consecutive fashion; however, because bwa indices take several hours to put together, it will submit jobs to drmr for the index creation (job names BWAINDEX). You should wait for those jobs to finish before proceeding, with the exception of processing the RNA-seq data as that does not use BWA.
Running analyses
Now the pipelines can be run. Order is somewhat important, as some pipelines depend on others. Such dependences are noted below.
ATAC-seq processing and differential peak calling
To carry out the primary processing of the ATAC-seq data (adapter trimming, mapping, duplicate removal, filtering, peak calling, creating ataqv sessions), one can just run (from the top level of the repository):
make atacseq
The results of this analysis will end up in the ${DANFORTH_HOME}/work/atacseq
directory. Once this pipeline has finished running, one can merge the peak calls to get the set of 'master peaks', and determine the number of reads that fall within each of these master peaks for each ATAC-seq experiment (this information will be used in the differential peak calling):
make master_peaks # requires atacseq
The output from that will be in the ${DANFORTH_HOME}/work/master_peaks
directory. After this has completed, one can perform the differential peak calling:
make differential_peaks # requires master_peaks
The output will be in the ${DANFORTH_HOME}/work/differential_peaks
directory.
To generate an ataqv session for the ATAC-seq data, run:
make ataqv_session
The session will be created in the ${DANFORTH_HOME}/work/ataqv_session
directory.
To create the signal tracks (bigwig files) that can be used to visually compare samples, one must normalize the signal for sequencing depth. This can be done by running:
make atacseq_normalization # requires atacseq
The output will be in the ${DANFORTH_HOME}/work/gb/atacseq
directory.
RNA-seq processing, differential gene expression analysis, and KEGG pathway enrichment analysis:
To carry out the primary processing of the RNA-seq data, one must run:
make rnaseq # primary RNA-seq processing pipeline
The output will be in the ${DANFORTH_HOME}/work/rnaseq
directory. Once this has completed, the differential peak calling can be run:
make differential_gene_expression # requires rnaseq
The output will be in the ${DANFORTH_HOME}/work/differential_gene_expression
directory. After this has finished, one can run the KEGG pathway enrichment analysis:
make go # requires differential gene expression
The output will be in the ${DANFORTH_HOME}/work/go
directory.
To create the signal tracks (bigwig files) that can be used to visually compare samples, one must normalize the signal for sequencing depth (also, separate the signal by strand). This can be done by running:
make rnaseq_normalization # requires rnaseq
The output will be in the ${DANFORTH_HOME}/work/gb/rnaseq
directory.
Download and processing of the Weedon et al ChIP-seq data
Note that this will download and process more than just the H3K4me1 ChIP-seq data, but we only utilize the H3K4me1 data.
To do the downloading and processing:
make weedon_chipseq
The output will be in the ${DANFORTH_HOME}/work/weedon_chipseq
directory. In order to allow for comparisons with other experiments, the signal needs to be normalized for sequencing depth. This can be done by:
make weedon_normalization # normalizes the bigwigs created by weedon_chipseq
The output will be in the ${DANFORTH_HOME}/work/gb/weedon_chipseq
directory.
Download and processing of the Roadmap Epigenomics data
To download and process the Roadmap Epigenomics ChIP-seq data, one must run:
make roadmap_chipseq
The output will be in the ${DANFORTH_HOME}/work/roadmap_chipseq
directory. To normalize the signal from these experiments in order to allow for cross-tissue comparisons, once the above pipeline is complete you must run:
make roadmap_normalization
The output will be in the ${DANFORTH_HOME}/work/gb/roadmap_chipseq
directory.
Determining locations of the Ptf1a binding sites from Masui et al:
make masui
Re-creating figures
RNA-seq volcano plot and ATAC-seq volcano plot
To create this figure, one must first have recreated the ATAC-seq data processing, master peak creation, and differential peak analysis outlined above, as well as the RNA-seq data processing and differential gene expression calling. Once these pipelines have been successfully run, one can create these volcano plots using the command:
make atacseq_and_rnaseq_volcano_plots
H3K4me1 barplot
To create this figure, the Weedon et al data must have been processed and normalized, and the Roadmap Epigenomics data must have been processed and normalized. Also, because the signal refers to the signal over the region orthologous to the differential Gm13344 peak, the ATAC-seq data must have been processed and differential peak calling performed. Once this has been done, this figure can be created by running:
make h3k4me1_barplot
RNA-seq heatmap
To create this figure, one must have processed the RNA-seq data and run the differential gene expression analysis. Once these pipelines have successfully completed, you can re-create this figures with the command:
make rnaseq_heatmap
KEGG enrichment volcano plot
To create this figure, the KEGG enrichment analysis needs to have been run. Once this is done, the KEGG plot can be created using the command:
make kegg_volcano_plot
Plot of RNA-seq signal at/near insertion (Fig S1)
To create this figure, the RNA-seq processing (make rnaseq
) should be run with the $DANFORTH_RNASEQ_GENOME
environmental variable set to "danforth"
rather than "mm9"
. Once this is done, run make rnaseq_normalization
(keeping $DANFORTH_RNASEQ_GENOME
set to "danforth"
), and then run:
make transcription_off_insertion
Re-creating table of FPKM values (in supplement):
To generate this table, run:
make fpkm_table