ZainulArifin1 / CUTandTag-Primary-Analysis

CUT&Tag Primary Analysis Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MIT License

Table of Contents

  1. About The Project
  2. Getting Started
  3. References Download
  4. Usage
  5. Roadmap
  6. License
  7. Contact
  8. Acknowledgments

About The Project

The following snakefile and supporting scripts are designed to process FASTQ files from CUT&Tag. Briefly they will perform the following:

  • Initial quality check with FastQC and MultiQC.
  • Alignment with Bowtie2. Alignment parameters are based on the recommendation by Henikoff et.al.
  • File conversion, sorting, and indexing with samtools.
  • Removal of low quality reads (MAPQ<30).
  • Removal of reads mapped to blacklist region.
  • MACS2 peak calling.
  • 500k bin count matrix formation.
  • Reads normalization with TMM (normalization method from edgeR).
  • Coverage files generation (bigwig files)

Steps involving quality filtering and normalization is done to correct for read depth and signal-to-noise ratio bias.

!! IMPORTANT !!
Reads duplicate removal with Picard is not done in this workflow as CUT&Tag duplicates are likely to be real reads.

Getting Started

Clone the repository to your local folder with:

git clone https://github.com/ZainulArifin1/CUTandTag-Primary-Analysis.git

or with SSH key

git clone git@github.com:ZainulArifin1/CUTandTag-Primary-Analysis.git

Prerequisites

To ensure a consistent and isolated environment for your bioinformatics project, you will need to install the required packages and libraries using either Conda or Mamba package managers. While both options are viable, I recommend using Mamba due to its superior speed and reliability.

Using Mamba (Recommended)

Mamba is a faster and more efficient alternative to Conda. Follow these steps to create and activate your environment using Mamba:

  1. Open your Linux terminal or WSL for Windows user.

  2. Navigate to the directory containing the project's environment.yml file.

  3. Run the following command to create the environment:

mamba env create -f environment.yml
mamba activate cutandtag

Using Conda (Alternative)

If you prefer to use Conda, you can achieve the same environment setup using these steps:

  1. Open your Linux terminal or WSL for Windows user.

  2. Navigate to the directory containing the project's environment.yml file.

  3. Run the following command to create the environment:

conda env create -f environment.yml
conda activate cutandtag

Important packages used in this workflow:

  • bowtie2=2.5.1
  • fastqc=0.12.1
  • multiqc=1.15
  • deeptools=3.5.1
  • bedtools=2.31.0
  • subread=2.0.6
  • macs2=2.2.9.1
  • samtools=1.17
  • bioconductor-edger=3.40.0
  • r-tidyverse=1.3.2
  • snakemake=6.0.5
  • snakemake-minimal=6.0.5

Grab a tea or whatever you want while waiting because this going to take a while.

References Download

Please note that the GitHub repository does not include the indexed hg38 reference file required for your bioinformatics analysis. You will need to download and prepare this reference file separately. Follow the steps below to obtain and set up the reference files:

wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/GRCh38_no_alt.zip

Do not forget to unzip the folder. Put the indexed references inside the folder "GRCh38_noalt_as".

(back to top)

Usage

If you have your FASTQ files ready, ensure that your folder follows the specific structure detailed below (VERY IMPORTANT):

You can see the schematic below:

├── data
│   ├── blacklist
│   │   └── hg38-blacklist.v2.bed
│   ├── raw_fastq
│   │   ├── SRAEXAMPLE_R1.fastq
│   │   └── SRAEXAMPLE_R2.fastq
│   └── reference
│       ├── GRCh38_noalt_as
│       │   ├── GRCh38_noalt_as.1.bt2
│       │   ├── GRCh38_noalt_as.2.bt2
│       │   ├── GRCh38_noalt_as.3.bt2
│       │   ├── GRCh38_noalt_as.4.bt2
│       │   ├── GRCh38_noalt_as.rev.1.bt2
│       │   └── GRCh38_noalt_as.rev.2.bt2
│       ├── hg38_chrom_sizes
│       └── hg38_chrom_sizes_binned_500k.bed
├── environment.yml
├── job.s
├── LICENSE
├── README.md
├── scripts
│   ├── count_effective_genome_size.sh
│   ├── count_reads_in_bam.sh
│   ├── mergeCount.R
│   ├── remove_dup_column.R
│   ├── search_files_run64.sh
│   └── TMM.R
└── snakefile_k27me3_k4

Before running the snakefile, make sure of the following for your snakefile:

  1. In the example snakefile, "data/h3k27me3_h3k4/raw_fastq/" is the directory of the raw FASTQ files. Please adjust EVERY INSTANCES accordingly. You can do find and replace to safe time.

  2. Please adjust the naming of FASTQ files. Example: "data/h3k27me3_h3k4/raw_fastq/{sra}_1.fastq". In this case the forward read is denoted as "_1" and the file extension is .fastq (can be fastq.gz). This instance is in rule bowtie2 (line 72)

  3. Perform dry run (test run) with the following command

snakemake -np -s <snakefile_name>

Running the snakefile

Thats it! You have done all the hardwork and now you can just run following code and check in a few hours (or days depending on your data and resources).

snakemake --cores <num_of_cores> -s <snakefile_name>

If you are running the program in cluster, you can modify the sbatch file and run it with:

sbatch job.s

If there is an error, do not be afraid! Check a file called errLog and it will tell you where the error is. Should you require any help, please raise an issue on GitHub or contact me through my email at muhammad.arifin@ucdconnect.ie.

(back to top)

Roadmap

  • Commit the workflow to GitHub
  • Full pipeline with minimum reproducible workflow

To customize the Snakefile for the CUT&Tag workflow, you can make adjustments based on the CUT&Tag tutorial available in this website.

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Muhammad Zainul Arifin
PhD Student, University College Dublin

(back to top)

Acknowledgment

I am grateful to the Dey Lab for affording me the opportunity to contribute to the project that has led to the establishment of this GitHub repository.

(back to top)

About

CUT&Tag Primary Analysis Data

License:MIT License


Languages

Language:R 62.5%Language:Shell 37.5%