DeepCT inference helper

A Snakemake-based pipeline to obtain DeepCT predictions for a VCF file.

How to run

Install git LFS before cloning this repo:

sudo apt install git-lfs

Install necessary software:

conda env create -f environment.yml
conda activate deepct-inference-helper

(Yes, this is not the proper Snakemake way, but currently not every used program has a Snakemake wrapper implemented. Also DeepCT has very fragile requirements.)

Pull DeepCT:

git submodule init
git submodule update

Provide paths to genome files and liftover chains in the config/config.yml (there's an example). You'll need faidx'ed hg38 and hg19, and chains for both hg19→hg38 and hg38→hg19.
Put your hg38 VCF in the input/ directory.
Run Snakemake:

snakemake --cores all output/file.vcf

Note that you put file in the input/ but request the output/.

What this does

TLDR: You put your hg38 VCF in input/, it converts variants to hg19, runs the model on them, adds three new annotation fields under the INFO/ (DEEPCT_CHANGE, DEEPCT_ORGANS, DEEPCT_CELLS) to variants with significant hits, converts them back to hg38, and puts hg38 VCF in output/.

In detail:

Two-stage liftover with CrossMap: one as usual, and then variants that failed due to allele mismatch with the hg19 will have their REF and ALT swapped, most of them will be successfully lifted.
Filtering. It will remove indels, variants on chrM, and variants with non-AGTC alleles.
Cache check. Variants are checked against the cache (simply via bedtools intersect) and separated in two parts.
Convert uncached to TSV. This is the format (CHROM POS REF ALT) that our fork of Selene uses.
Create a configuration file for Selene.
Run Selene.
Extract predictions (from numpy-array) and add them to VCF.
Copy old cache and newly predicted variants to a new cache and then replace the old one with it.
Meanwhile the second part (cached) gets annotated from a cache.
Merge both parts back.
Two-stage liftover just as in pt.1, only this time back to hg38 from hg19.

Computing resources

Inference is slow. One Tesla V100 processes ~6.5 variants per second. 24 cores of Xeon E5-2690 process 1 variant in 20 seconds.
That's why there's a cache implemented. All your processed variants will be added to your local cache at cache/cache.vcf.gz. Cache is another potential bottleneck by itself. At ~1M cached variants its delays are already in single minutes.
RAM requirements are ~1G per 2000 variants. It's better to split your files in chunks (of like 50K). Just remember keep the VCF headers in all of them.
GPU memory usage is stable at 2800 MB.
You can change CUDA device in the templates/inference.template.yml at line 27. Or use just CPU (comment out that line). Please refer to the Selene documentation.

General notes

You might want to clean tmp/ manually afterwards.
You might (and probably will) lose some variants due to lifting to hg19 and back.
Indels will be forcefully filtered out. Selene has to be modified further to support these. Mitochondrial variants are also skipped.
Cache is in hg19. Your input and output files are in hg38.

Authors & License

DeepCT by Sindeeva et al. Cell type-specific interpretation of noncoding variants using deep learning-based methods.

Provided under Apache License 2.0.

AIRI-Institute / DeepCT-inference-helper