zhyanlin / HiCSampler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HiCSampler: Posterior inference of Hi-C contact frequency through sampling

Hi-C is one of the most widely used approaches to study three-dimensional genome conformations. Contacts captured by a Hi-C experiment are represented in a contact frequency matrix. Due to the limited sequencing depth and other factors, Hi-C contact frequency matrices are only approximations of the true interaction frequencies, and are further reported without any quantification of uncertainty. Hence downstream analyses based on Hi-C contact maps (e.g TAD and loop annotation) are themselves point estimations. Here, we present the Hi-C interaction frequency sampler (HiCSampler) that reliably infers the posterior distribution of interaction frequency for a given Hi-C contact map by exploiting dependencies between neighboring loci. Posterior predictive checks demonstrate that HiCSampler is able to infer highly predictive chromosomal interaction frequency. Summary statistics calculated by HiCSampler provide a measurement of the uncertainty for Hi-C experiment, samples inferred by HiCSampler are ready for use by most downstream analyzes tools off the shelf and permit uncertainty measurements in these analyzes without modifications.

1. Installation

Please run the following commands to install HiCSampler:

git clone https://github.com/zhyanlin/HiCSampler
cd HiCSampler
make

You can now run HiCSampler as follows: ./HiCSampler

2. Example run:

./HiCSampler sampledata/test.RC.txt sampledata/output.txt.gz --bias=./sampledata/test.bias.txt --it=5000 -w=8 --threads=10

This will run HiCsampler for 5000 iterations after the burn-in phase to sample posterior contact maps from read count matrix RC.tsv. It will use a window size of 17x17 to estimate variance in pairwise potentials (17=2*8+1). It outputs summary stats for the posterior distribution to output.txt.gz. In addition, it save samples to folder sampledata.

3. Parameters:

--it: number of MCMC interactions.

--threads: number of threads.

--bias: bias vector in ICE normalization.

-w: window size for variance estimation in pairwise potential. The window size would be 2*w+1 [default 8].

--stepSize: step size for sub-dividing Hi-C contact maps into blocks [default 200].

4. Input and Output format:

4.1 Input

HiCSampler take two files as input:

4.1.1. read count matrix:

The read count matrix file is a list of read count (RC). Each row in such a file represent a pair of contact bins containing:

chrom1	bin1 chrom2 bin2 RC

4.1.2. bias vector:

The ith row is the bias for the i-1th bin:

 bin_num   bias

4.2 Output

The output of HiCSampler is a list of posterior interaction freqency (IF) matrix sampled by HiCSampler every 50 steps. Each matrix is stored in a file. Each row in such a file is a pair of contact pairs containing:

chrom1	bin1 chrom2 bin2 IF

4. Preparing input from data in .[m]cool format

You can use the follow command to convert a HiC data in .[m]cool format into HiCSampler's input format. Both 4.1 and 4.2 output read counts to output.RC.tsv, bias vector to output.bias.tsv

4.1 Convert from .mcool file:

bash ./script/convertfromCool.sh Your_HiC_File.mcool::/resolutions/resol region resol output

4.2 Convert from .cool file:

bash ./script/convertfromCool.sh Your_HiC_File.cool region resol output

About


Languages

Language:C++ 92.9%Language:Python 2.9%Language:C 1.8%Language:Makefile 1.8%Language:Shell 0.6%