snakePipes

This repository is a modification of the snakePipes workflows forked from https://github.com/maxplanck-ie/snakepipes

Workflows available

DNA-mapping*
ChIP-seq*
RNA-seq*
ATAC-seq*
scRNA-seq
Hi-C
Whole Genome Bisulfite Seq/WGBS

(*Also available in "allele-specific" mode)

We have made modifications to the DNA-mapping and ChIP-seq workflows, in order to make them compatible with the current practices in the Zwart lab. These include mapping single-end reads with BWA, marking duplicates with picard, filtering out reads with a MAPQ score below 20, and estimating fragment size with phantompeakquals for MACS2 peakcalling of single-end reads. The others workflows remain intact from the original pipeline, and their functionality should not be affected.

Installation

Begin by logging into harris server and entering the terminal environment.

First initialize conda in your environment with:

/opt/miniconda3/bin/conda init

Then run:

source ~/.bashrc

To ensure there are no issues with initializing conda, please log out of RStudio, open a new browser window, log into RStudio, and start a new instance of terminal. At the terminal prompt you should see:

(base) your.name@harris:~$

If you do not, run:

source ~/.bashrc

again. If you still do not see (base), seek help.

Ensure you have the proper conda path (/opt/miniconda3/bin/conda) by running:

which conda

Ensure conda is properly initialized by running:

conda --version

Configure the directory for pkgs to be installed with:

conda config --add pkgs_dirs ~/.conda/pkgs/

Change into your home directory (or wherever you wish to clone this repository) with:

cd ~

If this is your first install you can skip this, but if you previously downloaded this respository remove the previous version with:

rm -rf snakepipes

Clone this repository into your desired location with:

git clone https://github.com/csijcs/snakepipes.git

Change directory into the snakepipes folder with:

cd snakepipes

Install a new snakepipes environment with:

conda env create --file snakepipes.yaml

Activate the snakepipes environment with:

conda activate snakepipes

At the terminal prompt you should see:

(snakepipes) your.name@harris:~$

Run the build script with:

sh build.sh

Run the following to create or update the various environments required for the pipelines by running:

snakePipes createEnvs --condaDir ~/.conda/envs/snakepipes

You do not need to create indices for hg19 or hg38. We provide premade indices stored in a shared location for hg19 and hg38, using the exact fasta and annotation files from the core facility. If you do need additional indices for another organism or genome build, seek assistance from a computational expert.

Renaming files

**Note - all of youre sequencing filenames should contain a wz number (i.e. wz3909). Make sure to submit your samples with a wz number in the name or this script will not work. If there are two samples with the same wz number (i.e. same sample split across two lanes) the second file will be renamed wzNUMBER_2 (i.e. wz3909_2). If there are more than two (not likely, but possible), it will give an error and not rename your additional files. If you do actually have more than two (i.e. same sample split across more than two lanes), seek professional help.

Before starting a pipeline, it's best to rename your files. The files from the core come with a very long filename (i.e. 5905_25_wz3909_TGACTTCG_S35.bam) and we will shorten this to just the wz number (i.e. wz3909.bam).

To accomplish this, we have provided an R script above (rename_files.R). This script can either be run from within R, or from the terminal. To run from within R, set your working directory to the folder contaning your files (bam or fastq):

setwd("/DATA/first_initial.last_name/your_files/")

If you copy the script to the same folder as your files, you can run:

source('rename_files.R')

Otherwise this can be run from the RStudio script window.

If you prefer, you can also run from terminal by copying the script into the folder containing your files and running:

Rscript rename_files.R

Either way, this will rename all your files and move them into a folder called "rename". All of the files should have been moved into this folder, so if there are any remaining then something went wrong and you should seek help.

Once your files are renamed, you are now ready to proceed with the appropriate pipeline below.

ChIP-seq from bam files

**Note - New projects in the lab should be getting mapped to hg38, while ongoing projects that were previously mapped to hg19 should stay with hg19. Ensure you are not mixing hg38 and hg19 in your project or the results will be incomparable.

If you have .bam files aligned by the core, you can run the ChIP-seq pipeline on these after first renaming them. All of your .bam files should be renamed in a folder called "rename". You will need to supply the path to the "from_bam.yaml" in the snakepipes folder downloaded with this repository. You will also need to supply a "sample_config.yaml" file, telling the program your sample names, the control for each sample, and whether to look for broad peaks (i.e. histone marks) or narrow peaks (i.e. transcription factors). See the example sample_config.yaml file in the snakepipes folder downloaded with this repository.

For single-end reads aligned to hg19 the command to run is:

ChIP-seq -d /PATH/TO/OUTPUT/DIR --fromBam /PATH/TO/bam/rename --configfile /PATH/TO/snakepipes/from_bam.yaml --local -j 10 --single-end hg19 sample_config.yaml

Here -d specifies the path to the output directory of your choice, --fromBam is the path to your rename folder containing the renamed bams, and hg19 specifies the genome build.

For paired-end reads aligned to hg38 the command to run is:

ChIP-seq -d /PATH/TO/OUTPUT/DIR --fromBam /PATH/TO/bam/rename --configfile /PATH/TO/snakepipes/from_bam.yaml --local -j 10 hg38 sample_config.yaml

There will be various folder outputs, including some QC, and the peak files will be in the MACS2 folder. For narrow peaks, the macs2 output will end in ".narrowPeaks", and we have added chr to the chromosome numbers in the file ending in ".chr.narrowPeaks" for your convenience. For hg38, the ".chr.narrowPeaks" (or ".chr.broadPeaks") files have had the blacklist regions removed, while the ".narrowPeaks" (or ".broadPeaks") have not. For hg19 these regions are not removed. For paired-end reads, the pipepline will run MASC2 in both single-end and paired-end mode for your comparison.

Running Pipelines in screen

Running pipelines will take some time, so you will want to run in screen to avoid interruptions. To do this, just add screen -dm before your command, like this:

screen -dm ChIP-seq -d /PATH/TO/OUTPUT/DIR --fromBam /PATH/TO/bam/rename --configfile /PATH/TO/snakepipes/from_bam.yaml --local -j 10 --single-end hg19 sample_config.yaml

It will look like nothing is happening, but it is running in detached mode and will not be interrupted if your session disconnects. Furthermore, it will disconnect automatically when it is finished. You can see what screens you have running with:

screen -ls

If you run screen -ls immediately after executing your screen -dm ChIP-seq... command and you do not see an output for your running screen, then something went wrong (or your environment isn't activated). You can check the log files or seek help.

DNA-mapping

If you have .fastq files your would like to perform ChIP-seq anylysis on, you will first need to run the DNA-mapping pipeline. For DNA mapping, we generally recommend using BWA. To do this, supply the path to the location of the bwa_mapping.yaml downloaded with this hub. After the renaming step above, all of your fastq files should be in a folder called rename. Be sure you know the appropriate genome build for your project (i.e. hg19 or hg38). For example, to run DNA mapping with BWA to hg38, run the following command:

DNA-mapping -i /PATH/TO/FASTQ/rename -o /PATH/TO/OUTPUT/DIRECTORY --configfile /PATH/TO/snakepipes/bwa_mapping.yaml --local -j 10 --mapq 20 --trim --trim_prg cutadapt --fastqc hg38

Here, -i specifies the input folder contaning the fastq files, -o is the output directory of your choosing, and hg38 specifies the genome build (adjust to hg19 if necessary for your specific project). The rest of the parameters should not be altered for standard ChIP-seq experiments.

**Note - Previous hg19 projects as well as many existing hg19 projects in the Zwart lab have been mapped using the bwa-backtrack algorithm. For legacy reasons, if you need your peakcalling results to match EXACTLY to previous results, we recommend using the bam files supplied by the core and taking them through the ChIP-seq from bam pipeline. The BWA option for this DNA-mapping pipeline uses the bwa-mem algorithm, which will produce very similar but not exactly the same results. For hg38 the core is using the bwa-mem algorithm, so this pipeline should produce the same results as the core facility.

ChIP-seq from DNA-mapping pipeline

The ChIP-seq pipeline is designed to take the ouput directly from the DNA-mapping pipeline. The only additional file you will need is a "sample_config.yaml" file, telling the program your sample names, the control for each sample, and whether to look for broad peaks (i.e. histone marks) or narrow peaks (i.e. transcription factors). See the example sample_config.yaml file above.

If you have run the DNA-mapping pipeline first, then (for single-end reads) run:

ChIP-seq -d /PATH/TO/DNA-mapping/OUTPUT --local -j 10 --single-end hg38 sample_config.yaml

Here -d is the directory with the output of the DNA-mapping pipeline, and it will also direct the output of the ChIP-seq pipeline there.

If you have paired-end reads, then run:

ChIP-seq -d /PATH/TO/DNA-mapping/OUTPUT --local -j 10 hg38 sample_config.yaml

**Note - The new projects should be getting mapped to the hg38 genome build, while ongoing projects that were previously mapped to hg19 should stay with hg19. Ensure you are not mixing hg38 and hg19 in your project or the results will be inconsistent.

**Note - The new Novaseq runs are paired-end reads. If you have paired-end reads simply remove the --single-end option.

Additional Pipelines

The other modules have remained untouched and should work according to the original pipeline https://github.com/maxplanck-ie/snakepipes

Finishing up

When your run is complete, check the MACS2 folder to ensure you have peak files for all your samples, as well as the QC_report folder for the "QC_report_all.tsv" and "all_samples_FRiP.tsv" files.

You should also check the run log on the output folder to ensure the run successfully finished with:

tail output/*.log

You can also check for errors within the log file with:

grep 'error' output/*.log

If this returns nothing, then you have no errors. If it returns an error, see the troubleshooting section below.

When you are finished you should deactivate your conda session to leave the environment with:

conda deactivate

This is a good practice so that you don't unintentially alter the environment.

Never install anything else within your snakepipes environment.

Every time you want to run more analysis you can simply activate your environment again with:

conda activate snakepipes

All the previously created environments and indices will still be there and you can proceed directly to the pipelines.

Troubleshooting

If your run did not successfully finish, the easiest first step is simply to run the pipeline again using the exact same command. Often times this will pick up where it left off previously and finish analyzing the remaining samples.

If you rerun the pipeline and receive an error that the contains:

`` IncompleteFilesException: The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with

snakemake --cleanup-metadata <filenames>

To re-generate the files rerun your command with the --rerun-incomplete flag. Incomplete files:

`` Then remove the incomplete files and rerun the pipeline. You may also need to remove any temporary folders "tmp.snakemake.*" before re-running.

If you continue to received errors, check the log files associated with step where the error occurred. The more information we have about why the error is ocurring, the easier it is to fix.

If the run finished but files are missing, you may need to delete the folder where the files should be and run the pipline again. Or you can create a new output folder and run only the samples with missing files. Worst case you can always delete the entire output folder and start the run from from scratch.

If things start to go wrong in your environment, of course feel free to reach out to the computational team for help. Sometimes things can be solved by logging out and loggin back into the server. If things REALLY go wrong in your environment, you can remove it and reinstall. This is a last resort scenario.

To do this, first deactivate your env with:

conda deactivate

Then remove the env with:

conda remove -n snakepipes --all

You can verify that the environment has been removed by checking for it in the environment list:

conda env list

Then start from the beginning of the instructions for a fresh install.

Documentation

For detailed documentation on setup and usage, please visit our read the docs page.

Citation

If you adopt/run snakePipes for your analysis, cite it as follows :

Bhardwaj V, Heyne S, Sikora K, Rabbani L, Rauer M, Kilpert F, et al. snakePipes enable flexible, scalable and integrative epigenomic analysis. bioRxiv. 2018. p. 407312. doi:10.1101/407312

Note

SnakePipes are under active development. We appreciate your help in improving it further. Please use issues to the GitHub repository for feature requests or bug reports.

csijcs / snakepipes