r3fang/snATAC

NOTE: This project is no longer under maintenance, please find our latest pipeline SnapATAC here

Introduction

snATAC is a Ren-lab in-house bioinformatics pipeline for single-nucleus ATAC-seq (snATAC-seq). Download the our full data set from here.

Dependency

softwares (samtools, Python 2.7, bedtools, MACS2, bwa)
python packages (pysam, pybedtools)
R packages (tsne, GenomicRanges, densityClust)

Quick Install

> git clone --depth=1 https://github.com/r3fang/snATAC.git
> cd snATAC
> chmod u+x bin/*
> gcc -g -O2 src/calculate_jacard_index_matrix.c -o bin/snATAC_jacard
> export PATH=$PATH:./bin/
> ./bin/snATAC

Program:  snATAC (snATAC-seq analysis pipeline)
Version:  12.24.2017
Contact:  Rongxin Fang <r3fang@ucsd.edu>
          Sebastian Preissl <spreissl@ucsd.edu>
          Bing Ren <biren@ucsd.edu>

Usage:    snATAC <command> [options]

Command:  pre           preprocessing
          filter        filter reads with unselected barcodes
          bmat          binary accessible matrix
          jacard        jaccard index matrix

Optional (under development):
          decomplex     decomplex the fastq file
          bstat         simple statistics for barcode

Note: To use snATAC pipeline, you need to first decomplex barcode
combination and integrate barcodes to the beginning of the
read name in both R1 and R2 fastq files.

Requirement

Before alignment, you need to first decomplex the FASTQ file by adding the barcode to the beginning of each read. Below is one example of decomplexed R1 and R2 for snATAC-seq. Each read name follows this format: "@" + "barcode" + ":" + "original read_name". Unfortunately, because barcode design may be different between each experiment, we decided not to include this part in the pipeline.

> zcat p56.R1.fastq.gz | head
@TCCGGAGATAAGGCGAAAGGAGTAATAGAGGC:SN1113732HY737BCXX2110113481882 1N00
GTGTTGTTCTAGCTGGACAGGACAACTTCCTATCCTCCCCTTTAGCCCTA
+
DDDDDIIIIIIIIIIIIIIIHIIIIIIIHHIHIIIIIIIIHIIIHIIIHI

> zcat p56.R2.fastq.gz | head
@TCCGGAGATAAGGCGAAAGGAGTAATAGAGGC:SN1113732HY737BCXX2110113481882 2N00
CGGGCTCCTCGGCCGATATGTATGAGTAGGAAGGTGTCCTGTCCAGCTAG
+
<0000/11110000/<///<<11111<111<110<DCH1G<<<11111<1

Pipeline

snATAC has following steps:

Alignment by bwa or bowtie2 (output.bam);
Pre-processing (output.bed/output.qc);
- Alignment filteration
- Speration of single cell
- PCR duplicates removal
- Mitochondrial reads removal
- Adjusting position of Tn5 insertion
- Create a master .bed file
- Create a .qc file
Identify feature candidates (output.txt);
- Call peaks using MACS2 from ensemble data
Calculate barcode statistics;
- Reads per barcode
- Consecutive promoter coverage
- Reads in peak ratio
Barcode selection (output.xgi);
- Reads per barcode >= 1000
- Consecutive promoter coverage > 5%
- Reads in peak ratio >= 20%
- NOTE: above cutoff can vary singificant between different samples
- Filter potential doublets
Feature selection (output.ygi);
- Filter top 5% peaks
- Filter promoters
Binary accesibility matrix (output.mat);
Jaccard index matrix (output.jacard);
Dimentionality reduction (output.tsne);
Density-based clustering (output.cluster);

A Complete Example

Step 0. Download sample data.

> wget http://renlab.sdsc.edu/r3fang/projects/Preissl_Nat_Neuro/data_raw/p56.rep1.R1.decomplex.fastq.gz
> wget http://renlab.sdsc.edu/r3fang/projects/Preissl_Nat_Neuro/data_raw/p56.rep1.R2.decomplex.fastq.gz
> wget http://renlab.sdsc.edu/r3fang/snATAC/mm10.tar.gz
> tar -xvzf mm10.tar.gz

Step 1. Alignment using bwa (or bowtie2) in pair-end.

> bwa mem -t 1 mm10.fa \
          p56.rep1.R1.decomplex.fastq.gz \
          p56.rep1.R2.decomplex.fastq.gz \
          | samtools view -bS - > p56.rep1.bam
> samtools sort -t 5 -n p56.rep1.bam -o p56.rep1.nsrt.bam

Step 2. Pre-processing.

> ./bin/snATAC pre -t 5 -m 30 -f 2000 -e 75 \
                   -i p56.rep1.nsrt.bam \
                   -o p56.rep1.bed.gz 2> p56.rep1.pre.log

./bin/snATAC pre will output two files p56.pre.bed.gz and p56.rep1.pre.log. p56.pre.bed.gz contains read and corresponding barcode, p56.rep1.pre.log contains basic quality control. Below is one example of p56.rep1.pre.log.

p56.pre.log
number of totol reads	217064143
number of uniquely mapped reads	191813608
number of properly paired reads	189561202
number of chrM reads	5334929
number of usable reads	184310563
number of distinct reads	52036692
estimated duplicate rate	0.71

Step 3. Identify feature candidates (output.txt)

> macs2 callpeak -t p56.rep1.bed.gz \
                 -f BED -n p56.rep1 \
                 -g mm -p 0.05 \
                 --nomodel --shift 150 \
                 --keep-dup all

Step 4. Calculate barcode statistics

##################################################################
# calculate 3 major barcode stats
# 1) number of reads
# 2) consecutive promoter coverage 
# 3) reads in peak ratio
##################################################################
# count number of reads per barcode
> zcat p56.rep1.bed.gz | awk '{print $4}' \
               | sort \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 > p56.rep1.reads_per_cell
# consecutive promoter coverage 
> intersectBed -wa -wb \
               -a p56.rep1.bed.gz \
               -b mm10_consecutive_promoters.bed \
               | awk '{print $4, $8}' \
               | sort \
               | uniq \ 
               | awk '{print $1}' \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 > p56.rep1.promoter_cov 
# reads in peak ratio
> intersectBed -a p56.rep1.bed.gz -b p56.rep1_peaks.narrowPeak -u \
               | awk '{print $4}' \
               | sort \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 - > p56.rep1.reads_in_peak

Step 5. Cell selection (output.xgi)

> R
##################################################################
# 1) reads per barcode >= 1000
# 2) consecutive promoter coverage > 3%
# 3) reads in peak ratio >= 20%
# NOTE: The cutoff can vary singificantly between different samples
# 4) filter potential doublets (OPTIONAL NOT SUGGUESTED, UNSTABLE)
##################################################################

consecutive_promoters <- read.table("mm10/mm10_consecutive_promoters.bed")
num_of_reads = read.table("p56.rep1.reads_per_cell")
promoter_cov = read.table("p56.rep1.promoter_cov")
read_in_peak = read.table("p56.rep1.reads_in_peak")
qc = num_of_reads; 
colnames(qc) <- c("barcode", "num_of_reads")
qc$promoter_cov = 0; 
qc$read_in_peak = 0;
qc$promoter_cov[match(promoter_cov$V1, qc$barcode)] = promoter_cov$V2/nrow(consecutive_promoters)
qc$read_in_peak[match(read_in_peak$V1, qc$barcode)] = read_in_peak$V2
qc$ratio = qc$read_in_peak/qc$num_of_reads
idx <- which(qc$promoter_cov > 0.09 & qc$ratio > 0.2 & qc$num_of_reads > 1000)
qc_sel <- qc[idx,]

# among these cells, further filter PUTATIVE DOUBLETS ((OPTIONAL NOT SUGGUESTED, UNSTABLE))
#pvalues <- sapply(qc_sel$num_of_reads, function(x) 
#           poisson.test(x, mean(qc_sel$num_of_reads), 
#           alternative="greater")$p.value)
#fdrs <- p.adjust(pvalues, "BH")
#idx <- which(fdrs < 1e-2)

write.table(qc_sel[,1], file = "p56.rep1.xgi", append = FALSE, 
			  quote = FALSE, sep = "\t", eol = "\n", 
			  na = "NA", dec = ".", row.names = FALSE,
			  col.names = FALSE, qmethod = c("escape", "double"),
			  fileEncoding = "")

Step 5. Feature selection (output.xgi);

> R
##################################################################
# 1) Filter top 5% peaks
# 2) Filter promoters
# 3) extend and merge
##################################################################
library(GenomicRanges)
peaks.df <- read.table("p56.rep1_peaks.narrowPeak")
# remove top 5% peaks
cutoff <- quantile((peaks.df$V5), probs = 0.95)
peaks.df <- peaks.df[which(peaks.df$V5 < cutoff),]
proms.df <- read.table("mm10/mm10.refSeq_promoter.bed")
peaks.gr <- GRanges(peaks.df[,1], IRanges(peaks.df[,2], peaks.df[,3]))
proms.gr <- GRanges(proms.df[,1], IRanges(proms.df[,2], proms.df[,3]))

peaks.sel.gr <- peaks.gr[-queryHits(findOverlaps(peaks.gr, proms.gr))]
peaks.sel.ex.gr <- resize(reduce(resize(peaks.sel.gr, 1000, 
                          fix="center")), 1000, fix="center")

peaks.sel.ex.df <- as.data.frame(peaks.sel.ex.gr)[,1:3]
write.table(peaks.sel.ex.df, file = "p56.rep1.ygi", 
			  append = FALSE, quote = FALSE, sep = "\t", 
			  eol = "\n", na = "NA", dec = ".", 
			  row.names = FALSE, col.names = FALSE, 
			  qmethod = c("escape", "double"),
			  fileEncoding = "")

Step 6. Generate binary accessibility matrix (NOTE: this may require large RAM)

> ./bin/snATAC bmat -i p56.rep1.bed.gz \
                    -x p56.rep1.xgi \
                    -y p56.rep1.ygi \
                    -o p56.rep1.mat

Step 7. Calculate jaccard index (NOTE: snATAC jacard may require large memory when you have more cells, "Segmentation fault" means you need larger RAM)

# mannually count number of rows (1,464)
> wc -l p56.rep1.xgi
# mannually count number of columns (184,519)
> wc -l p56.rep1.ygi
# calculate jaccard index matrix
> ./bin/snATAC jacard -i p56.rep1.mat -x 1464 -y 184519 -o p56.rep1.jacard

Step 8. Dimentionality reduction (R) (NOTE: tSNE is a heristic method, it is expected that different run can have slightly different result)

> R
##################################################################
# Dimentionality reduction using t-SNE
# Please refer to https://lvdmaaten.github.io/tsne/ for how to
# tune parameters such as perplexity. We find most of time, it 
# takes less than 500 iteration to converge.
##################################################################

library(tsne)

data <- as.matrix(read.table("p56.rep1.jacard"))
diag(data) <- 0
b = tsne(data/sum(data), initial_config = NULL, 
		  k = 2, perplexity = 30, max_iter = 500, 
		  min_cost = 0, epoch_callback = NULL, 
		  whiten = TRUE, epoch=100)

plot(b, cex=0.7, xlab="tsne1", ylab="tsne2")
write.table(b, file = "p56.rep1.tsne", append = FALSE, 
            quote = FALSE, sep = "\t", eol = "\n", 
            na = "NA", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

Step 9. Density-based clustering

> R
##################################################################
# NOTE: Choosing cluster number is absolutely an art! In our paper, 
# we adopted a method originally published from Habib et al. that
# determines the best cluster number by optimizing Dunn Index.
# However, such approach is extremely slow with time complexity
# O(n^3), without optimization and better implementation, it is
# almost impossible to run on a sample of thousands of cells.
# Therefore, we decided to switch to a heuristic approach as shown
# below. Though, it does not guarantee the best result (in fact, 
# none of the methods do), from our experience, it gives satisfactory
# result. At the end of the day, choosing the optimal cluster number
# is very tricky task and it is hard to be fully automatic. We still
# post our code for Dunn Index approach as described in our paper. 
##################################################################

library(densityClust)
MaxStep <- function(D){
	D_hat <- D
	n = nrow(D)
	for(k in 1:n){
		for(i in 1:(n-1)){
			for(j in (i+1):n){
				D_hat[i,j] <- min(D_hat[i,j], 
				max(D_hat[k,j], D_hat[i,k]))
			}
		}
	}
	return(D_hat)
}
find_center <- function(p, clust){
	centers <- data.frame()
	cols <- c()
	for(i in as.numeric(names(table(clust)))){
		ind <- which(clust== i)
		if(length(ind) > 10){
			centers <- rbind(centers, colMeans(p[ind,1:3]))
			cols <- c(cols, i)
		}
	}
	res <- cbind(centers, cols)
	colnames(res) <- c("x", "y", "z", "col")
	return(res)
}
DB_index <- function(D, CL){
	cl_uniq <- unique(CL)
	n <- length(cl_uniq)
	d_k <- rep(0, n)
	d_ij <- matrix(0, n, n)
	for(i in 1:n){
		ii <- which(CL == i);
		d_k[i] <- max(MaxStep(D[ii,ii]))
	}
	for(i in 1:(n-1)){
		for(j in (i+1):n){
			ii <- which(CL == i | CL == j)
			d_ij <- max(MaxStep(D[ii,ii]))
		}
	}
	return(min(d_ij)/max(d_k))
}

# perform cluster
points <- read.table("p56.rep1.tsne")

set.seed(10)
dis <- dist(points)
irisClust <- densityClust(dis, gaussian=TRUE)
# plot decision graph
plot(irisClust)

rho_cutoff <- irisClust$dc
delta_cutoff <- 20
irisClust <- findClusters(irisClust, 
                          rho= rho_cutoff, 
                          delta= delta_cutoff)
clusters <- irisClust$cluster

plot(points, col=clusters, cex=0.5, pch=19)
dev.off()

write.table(data.frame(cluster), file = "p56.rep1.cluster",
			append = FALSE, quote = FALSE, sep = "\t",
            eol = "\n", na = "NA", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

Cite us

Preissl S.*, Fang R.*, Zhao Y., Raviram R., Zhang Y., Brandon C.S., Huang H., Gorkin D.U., Afzal V., Dickel D.E., Kuan S., Visel A., Pennacchio L.A., Zhang K., Ren B. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat Neurosci. 2018 Feb 12. doi: 10.1038/s41593-018-0079-3.(* contributed equally)

File Organization

.
# raw data
|____p56.rep1.R1.decomplex.fastq.gz
|____p56.rep1.R2.decomplex.fastq.gz
# pre-defined features for mm10
|____mm10
| |____mm10.genome.size
| |____mm10_consecutive_promoters.bed
| |____mm10.refSeq_promoter.bed
# output from alignment
|____p56.rep1.bam
# output from pre-proessing
|____p56.rep1.log
|____p56.rep1.bed.gz
# output from MACS2
|____p56.rep1_peaks.xls
|____p56.rep1_peaks.narrowPeak
|____p56.rep1_summits.bed
# output from barcode stats
|____p56.rep1.reads_in_peak
|____p56.rep1.promoter_cov
|____p56.rep1.reads_per_cell
# output from barcode selection
|____p56.rep1.xgi
# output from feature selection
|____p56.rep1.ygi
# output from binary accessibility matrix
|____p56.rep1.mat
# output from jaccard matrix
|____p56.rep1.jacard
# output from dimentionality reduction
|____p56.rep1.tsne
# output from clustering
|____p56.rep1.cluster

r3fang / snATAC