wangjuan001 / HiCPlus_pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HiCPlus Impletmented by PyTorch

Dependency

Installation

Clone the repo to your local folder.

$ git clone https://github.com/wangjuan001/HiCPlus_pytorch.git

Usage

Prediction

If the user doesn't train the model, just use runHiCPlus.py to generate the enhanced Hi-C interaction matrix.

Training

In the training stage, both high-resolution Hi-C samples and low-resolution Hi-C samples are needed. Two samples should be in the same shape as (N, 1, n, n), where N is the number of the samples, and n is the size of the samples. The sample index of the sample should be from the sample genomic location in two input data sets.

We provided a training pipeline for convenient usage dataGenerator.py.

This also includes the training dataset generating steps. All you need to do is to provide the high resolution matrix data (both sparse, dense matrix are fine), specify the chromosome number, downsampling rate, the name of output model name.

example:

python dataGenerator.py --input_file chr22.10k.obs.gm12878.matrix --chrN 22 --scale_factor 60 --out_model 80M_model

Prediction

Only low-resolution Hi-C samples are needed. The shape of the samples should be the same with the training stage. The prediction generates the enhanced Hi-C data, and the user should recombine the output to obtain the entire Hi-C matrix.

example:

python runHiCplus.py --input_matrix Matrixfile --model ../model/pytorch_gm12878_chr21_model_3900 --chr 1

Models

Models provided here are all suitable for prediction of 200M-400M sequencing depth HiC data, as the model was trained from ~4.6B GM12878 (Rao et al) at downsampling rate 16. We suggest you to generate your own model based on your own needs.

Input file generate

You can input both dense and sparse matrix file. An easy way to generate a test data is to use juicer. e.g.


java -jar juicer_tools.jar dump observed KR https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic 1:20480000:40960000 1:20480000:40960000 BP 10000 combined_10Kb.txt

Suggested way to generate samples

We suggest that generate a file containing the location of each samples when generate the samples with n x n size. Therefore, after obtaining the high-resolution Hi-C, it is easy to recombine all of the samples to obtain high-resolution Hi-C matrix.

Normalization and experimental condition

Hi-C experiments have several different types of cutting enzyme as well as different normalization method. Our model can handle all of the conditions as long as the training and testing are under the same condition. For example, if the KR normalized samples are used in the training stage, the trained model only works for the KR normalized low-resolution sample.

About

License:MIT License


Languages

Language:Python 93.7%Language:Shell 6.3%