The following snakefile and supporting scripts are designed to process FASTQ files from CUT&Tag. Briefly they will perform the following:
- Initial quality check with FastQC and MultiQC.
- Alignment with Bowtie2. Alignment parameters are based on the recommendation by Henikoff et.al.
- File conversion, sorting, and indexing with samtools.
- Removal of low quality reads (MAPQ<30).
- Removal of reads mapped to blacklist region.
- MACS2 peak calling.
- 500k bin count matrix formation.
- Reads normalization with TMM (normalization method from edgeR).
- Coverage files generation (bigwig files)
Steps involving quality filtering and normalization is done to correct for read depth and signal-to-noise ratio bias.
!! IMPORTANT !!
Reads duplicate removal with Picard is not done in this workflow as CUT&Tag duplicates are likely to be real reads.
Clone the repository to your local folder with:
git clone https://github.com/ZainulArifin1/CUTandTag-Primary-Analysis.git
or with SSH key
git clone git@github.com:ZainulArifin1/CUTandTag-Primary-Analysis.git
To ensure a consistent and isolated environment for your bioinformatics project, you will need to install the required packages and libraries using either Conda or Mamba package managers. While both options are viable, I recommend using Mamba due to its superior speed and reliability.
Mamba is a faster and more efficient alternative to Conda. Follow these steps to create and activate your environment using Mamba:
-
Open your Linux terminal or WSL for Windows user.
-
Navigate to the directory containing the project's environment.yml file.
-
Run the following command to create the environment:
mamba env create -f environment.yml
mamba activate cutandtag
If you prefer to use Conda, you can achieve the same environment setup using these steps:
-
Open your Linux terminal or WSL for Windows user.
-
Navigate to the directory containing the project's environment.yml file.
-
Run the following command to create the environment:
conda env create -f environment.yml
conda activate cutandtag
- bowtie2=2.5.1
- fastqc=0.12.1
- multiqc=1.15
- deeptools=3.5.1
- bedtools=2.31.0
- subread=2.0.6
- macs2=2.2.9.1
- samtools=1.17
- bioconductor-edger=3.40.0
- r-tidyverse=1.3.2
- snakemake=6.0.5
- snakemake-minimal=6.0.5
Grab a tea or whatever you want while waiting because this going to take a while.
Please note that the GitHub repository does not include the indexed hg38 reference file required for your bioinformatics analysis. You will need to download and prepare this reference file separately. Follow the steps below to obtain and set up the reference files:
wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/GRCh38_no_alt.zip
Do not forget to unzip the folder. Put the indexed references inside the folder "GRCh38_noalt_as".
If you have your FASTQ files ready, ensure that your folder follows the specific structure detailed below (VERY IMPORTANT):
You can see the schematic below:
├── data
│ ├── blacklist
│ │ └── hg38-blacklist.v2.bed
│ ├── raw_fastq
│ │ ├── SRAEXAMPLE_R1.fastq
│ │ └── SRAEXAMPLE_R2.fastq
│ └── reference
│ ├── GRCh38_noalt_as
│ │ ├── GRCh38_noalt_as.1.bt2
│ │ ├── GRCh38_noalt_as.2.bt2
│ │ ├── GRCh38_noalt_as.3.bt2
│ │ ├── GRCh38_noalt_as.4.bt2
│ │ ├── GRCh38_noalt_as.rev.1.bt2
│ │ └── GRCh38_noalt_as.rev.2.bt2
│ ├── hg38_chrom_sizes
│ └── hg38_chrom_sizes_binned_500k.bed
├── environment.yml
├── job.s
├── LICENSE
├── README.md
├── scripts
│ ├── count_effective_genome_size.sh
│ ├── count_reads_in_bam.sh
│ ├── mergeCount.R
│ ├── remove_dup_column.R
│ ├── search_files_run64.sh
│ └── TMM.R
└── snakefile_k27me3_k4
Before running the snakefile, make sure of the following for your snakefile
:
-
In the example snakefile, "data/h3k27me3_h3k4/raw_fastq/" is the directory of the raw FASTQ files. Please adjust EVERY INSTANCES accordingly. You can do find and replace to safe time.
-
Please adjust the naming of FASTQ files. Example: "data/h3k27me3_h3k4/raw_fastq/{sra}_1.fastq". In this case the forward read is denoted as "_1" and the file extension is .fastq (can be fastq.gz). This instance is in rule bowtie2 (line 72)
-
Perform dry run (test run) with the following command
snakemake -np -s <snakefile_name>
Thats it! You have done all the hardwork and now you can just run following code and check in a few hours (or days depending on your data and resources).
snakemake --cores <num_of_cores> -s <snakefile_name>
If you are running the program in cluster, you can modify the sbatch file and run it with:
sbatch job.s
If there is an error, do not be afraid! Check a file called errLog and it will tell you where the error is. Should you require any help, please raise an issue on GitHub or contact me through my email at muhammad.arifin@ucdconnect.ie.
- Commit the workflow to GitHub
- Full pipeline with minimum reproducible workflow
To customize the Snakefile for the CUT&Tag workflow, you can make adjustments based on the CUT&Tag tutorial available in this website.
Distributed under the MIT License. See LICENSE.txt
for more information.
Muhammad Zainul Arifin
PhD Student, University College Dublin
- Twitter: @SaintZainn
- Linkedin: Muhammad Zainul Arifin
- Email: muhammad.arifin@ucdconnect.ie
I am grateful to the Dey Lab for affording me the opportunity to contribute to the project that has led to the establishment of this GitHub repository.