DeNovoCNN

A deep learning approach to call de novo mutations (DNMs) on whole-exome (WES) and whole-genome sequencing (WGS) data. DeNovoCNN uses trio BAM/CRAM + VCF (or tab-separated list of variants) files to generate image-like genomic sequence representations and detect DNMs with high accuracy.

DeNovoCNN is a combination of three models for the calling of substitution, deletion and insertion DNMs. Each of the model is a 9-layers CNN with squeeze-and-excitation blocks. DeNovoCNN is trained on ~50k manually curated DNM and IV (inherited and non-DNM variants) sequencing data, generated using Illumina sequencer and Sureselect Human All Exon V5/Sureselect Human All Exon V4 capture kits.

DeNovoCNN returns a tab-separated file of format:

Chromosome | Start position | End position | Reference | Variant | DNM posterior probability | Mean coverage

We used DNM posterior probability >= 0.5 to create a filtered tab-separated file with the list of variants that are likely to be de novo.

Versions

1.0 corresponds to a version that is used in the publication.

How does it work?

DeNovoCNN reads BAM files and iterates through potential DNM locations using the input VCF files to generate snapshots of genomic regions. It stacks trio BAM files to generate and RGB image representation which are passed into a CNN with squeeze-and-excitation blocks to classify each image as either DNM or IV (inherited variant, non-DNM).

Manual installation

We advise to use our docker container (see Usage section). In case it's not possible, the easiest way of installing is creating an Anaconda environment.

#Create environment 
cd .../DeNovoCNN
conda env create -f environment.yml
conda activate tensorflow_env

Usage

Docker

DeNovoCNN is available as a docker container.

The example of DeNovoCNN usage for prediction (to use pretrained models, corresponding arguments shoud remain unchanged):

docker run \
  -v "YOUR_INPUT_DIRECTORY":"/input" \
  -v "YOUR_OUTPUT_DIRECTORY:/output" \
  gelana/denovocnn:1.0 \
  /app/apply_denovocnn.sh\
    --workdir=/output \
    --child-vcf=/input/<CHILD_VCF> \
    --father-vcf=/input/<FATHER_VCF> \
    --mother-vcf=/input/<MOTHER_VCF> \
    --child-bam=/input/<CHILD_BAM> \
    --father-bam=/input/<FATHER_BAM> \
    --mother-bam=/input/<MOTHER_BAM> \
    --snp-model=/app/models/snp \
    --in-model=/app/models/ins \
    --del-model=/app/models/del \
    --genome=/input/<REFERENCE_GENOME> \
    --output=predictions.csv

Parameters description and usage are described earlier in the previous section.

Singularity

singularity build denovocnn.sif docker://gelana/denovocnn:1.0

singularity run -B YOUR_INPUT_DIRECTORY:/input,YOUR_OUTPUT_DIRECTORY:/output \
    denovocnn.sif \
    /app/apply_denovocnn.sh \
    --workdir=/output \
    --child-vcf=/input/<CHILD_VCF> \
    --father-vcf=/input/<FATHER_VCF> \
    --mother-vcf=/input/<MOTHER_VCF> \
    --child-bam=/input/<CHILD_BAM> \
    --father-bam=/input/<FATHER_BAM> \
    --mother-bam=/input/<MOTHER_BAM> \
    --snp-model=/app/models/snp \
    --in-model=/app/models/ins \
    --del-model=/app/models/del \
    --genome=/input/<REFERENCE_GENOME> \
    --output=predictions.csv

Manual prediction

To use the pretrained models, you can provide the paths to the models from 'models' folder.

If you're running DeNovoCNN on WGS data, it is recommended to split the VCF files or variants of interest into 10 or more parts and run each of them separately and if possible in parallel. The separation could be done using the following commands:

   bcftools isec -C $BGZIPPED_CHILD_VCF $BGZIPPED_FATHER_VCF $BGZIPPED_MOTHER_VCF > all_variants.txt
   split -d -l 10000 --additional-suffix=.txt all_variants.txt part_variants

The resulting list of variants could be passed as -v parameter.

To run DeNovoCNN on all possible locations:

cd .../DeNovoCNN

./apply_denovocnn.sh \
-w=<WORKING_DIRECTORY> \
-cv=<CHILD_VCF> \
-fv=<FATHER_VCF> \
-mv=<MOTHER_VCF> \
-cb=<CHILD_BAM> \
-fb=<FATHER_BAM> \
-mb=<MOTHER_BAM> \
-sm=<SNP_MODEL> \
-im=<INSERTION_MODEL> \
-dm=<DELETION_MODEL> \
-g=<REFERENCE_GENOME> \
-o=predictions.csv

To run DeNovoCNN on a specified list (VARIANT_LIST_TSV) of locations:

./apply_denovocnn.sh \
-w=<WORKING_DIRECTORY> \
-v=<VARIANT_LIST_TSV>
-cb=<CHILD_BAM> \
-fb=<FATHER_BAM> \
-mb=<MOTHER_BAM> \
-sm=<SNP_MODEL> \
-im=<INSERTION_MODEL> \
-dm=<DELETION_MODEL> \
-g=<REFERENCE_GENOME> \
-o=predictions.csv

VARIANT_LIST_TSV is a tab-separated file of format:

Chromosome | Start position | Reference | Variant | Additional info

It could be generated by filtering the locations of interest of the result of this command:

   bcftools isec -C $BGZIPPED_CHILD_VCF $BGZIPPED_FATHER_VCF $BGZIPPED_MOTHER_VCF > all_variants_list.txt

Citation

If you use any of the materials in the repository, we would appreciate it if you cited our manuscript.

License

GNU GPLv3

Genome-Bioinformatics-RadboudUMC / DeNovoCNN