CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping

CNV-PG is an open-source application written in Python, including two parts: CNV predicting (CNV-P) and CNV genotyping (CNV-G). For CNV-P, we trained on a subset of validated CNVs from different CNV callers separately to obtain the corresponding classifier used for the identification of true CNVs. For CNV-G, a genotyper, which is compatible with existing CNV callers and generating a uniform set of high-confidence genotypes.

Prerequisites:

python3
sklearn
matplotlib
pysam
pandas
numpy

Getting started

1. CNV-P

Running:

Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh -h" to see the usage information.The follow options are required:

-i BAMFILE, the path of BAM file(generated by bwa commonly)   
-b BASFILE, the path of BAS file (provided by user)  
-v VCFFILE, the path of VCF file(the results of CNVcallers (breakdancer,Delly,Lumpy,Manta or Pindel)
-p PYTHON, the path of python  
-o OUTDIR, the results outdir  
-n SAMPLENAME, the prefix of outputfile  
-c CODE_PATH, the path of CNV-P code ($HOME/CNV-PG/CNV-P/)  
-s CNVCALLER, the name of CNVcaller (breakdancer,Delly,Lumpy,Manta or Pindel)

In the above command, "BASFILE" needs to be created extra, the format show as follow:
for example:

bam_filename    md5     study   sample  platform        library readgroup       #_total_bases   #_mapped_bases  #_total_reads   #_mapped_reads  #_mapped_reads_paired_in_sequencing     #_mapped_reads_properly_paired  %_of_mismatched_bases   average_quality_of_mapped_bases mean_insert_size        insert_size_sd median_insert_size       insert_size_median_absolute_deviation   #_duplicate_reads       coverage  
HG002   -       HG002   HG002   ILLUMINA        HG002   HG002   -       -       -       -       -      --       -       569     95      568.177944      163.819637      -       35.41

the Columns of "sample", "mean_insert_size", "insert_size_sd" and "coverage" are required in the step of feature extraction.

Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh" to classify candidate CNV:

$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \  
    -i $HOME/CNV-PG/test_data/BAMFILE \
    -b $HOME/CNV-PG/test_data/BASFILE \
    -e $HOME/CNV-PG/test_data/VCFFILE \
    -p $HOME/python/bin/python3 \
    -o $HOME/OUTDIR \
    -n SAMPLENAME \
    -c $HOME/CNV-PG/CNV-P \
    -s CNVCALLER

Outputs:

CNVCALLER.SAMPLENAME.fil.mer.bed # the results of candidate CNV Extract from VCF file
CNVCALLER.SAMPLENAME.feature.txt # the features matrix
CNVCALLER.SAMPLENAME.pre.prop.txt # the results of predicting which provide category and probability for each CNV

2.CNV-G

Runing:

Similar to the CNV-P, Run "$HOME/CNV-PG/CNV-P/CNV-G_predict.sh -h" to see the usage information.

$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \  
    -i $HOME/CNV-PG/test_data/BAMFILE \
    -b $HOME/CNV-PG/test_data/BASFILE \
    -e $HOME/CNV-PG/test_data/BEDFILE \
    -p $HOME/python/bin/python3 \
    -o $HOME/OUTDIR \
    -n SAMPLENAME \
    -c $HOME/CNV-PG/CNV-P \

The "BEDFILE" shuld be 5 Columns: chromsome, start, end, size of CNV, type of CNV (DUP:1,DEL:0); this also can be generate by CNV-P (such as CNVCALLER.SAMPLENAME.fil.mer.bed)
for example:

chr1    10482480        10483779        1300    0
chr1    16151940        16155439        3500    1
chr1    35101421        35111976        10556   0
chr1    39998214        40001244        3031    1
chr1    58743909        58744822        914     0
chr1    60048636        60049661        1026    0

Outputs:

SAMPLENAME.feature.txt #the features matrix
SAMPLENAME.pre.prop.txt #the results of genotype and probability for each CNV

Please help us improve CNV-PG by reporting bugs or ideas on how to make things better.

wonderful1 / CNV-PG

CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping

Prerequisites:

Getting started

1. CNV-P

Running:

Outputs:

2.CNV-G

Runing:

Outputs:

About

Languages