A simple tool to prioritize candidate pathogenic variants using mainly CAPICE scores (see: https://www.medrxiv.org/content/10.1101/19012229v1). Report variants of potential clinical interest for a specific case sample identifier in a VCF, while allowing for additional control samples to drop non-relevant case genotypes.
- Simple to operate with only a few command-line arguments.
- Allows any number of control samples to filter out non-relevant genotypes.
- Drops 99.9% tot 99.99% of variants even under relatively sensitive settings.
- Output is (very close to proper) VCF but suitable for human interpretation.
- It doesn't take ages to run (~20,000 variants/sec).
- For now, genome build GRCh37/hg19 only.
- Requires input VCF to be pre-annotated with CAPICE and GnomAD, which may not always be a trivial task. See details below.
- Popular VCF annotation tools don't always report missing annotation values (e.g. CAPICE, GnomAD) for multi-allelic variants correctly, we therefore cannot rely on exact allele matching of these criteria. A more relaxed strategy is applied, matching any allele annotation, leading to overdetection.
- Similarly, due to complexity of matching multi-allelic variant genotypes, any non-reference, non-missing genotype is considered as alternative genotype, leading to overdetection.
- Control samples provided do not need to be parents, therefore potential compound heterozygote detection is simplified to be defined as two or more heterozygous candidate variants within the same gene for the case, while controls are also allowed to have heterozygous genotypes for this variant. This leads to overdetection.
- No SVs or CNVs are considered even if present in the VCF file, leading to perhaps some heterozygous variants compounded with a potential SV/CNV to be missed. Of course, if this variant is de novo, it will be detected anyhow.
- Non-autosomal detection is simplified to avoid problems with hemizygous calls and no sex information is considered. Any interesting variant located on an allosome is reported, leading to overdetection.
- Variants with missing annotations (e.g. CAPICE, GnomAD) cannot be excluded based on score or frequency and therefore pass these checks. This leads to overdetection. Of course, genotype matching is done as usual.
Compile using Java 8+ with these dependencies:
com.github.samtools:htsjdk:2.22.0
org.molgenis:vcf-io:1.1.1
Download the JAR, a demo file, and run.
wget https://github.com/joerivandervelde/capice-quick-filter/releases/download/v0.0.1/capice-quick-filter-0.0.1.jar
wget https://github.com/joerivandervelde/capice-quick-filter/releases/download/demo-data-v0.0.1/CapiceQuickFilter_Demo_1000G.vep.vcfanno.vcf.gz
java -jar capice-quick-filter-0.0.1.jar CapiceQuickFilter_Demo_1000G.vep.vcfanno.vcf.gz CapiceQuickFilter_Demo_HG00096.vcf 0.2 0.05 HG00096 HG00171,HG00403
This section explains how to prepare any VCF in order to be analysed by CapiceQuickFilter.
The input VCF file must be annotated by Ensembl VEP. The GnomAD allele frequencies and gene symbols are used by CapiceQuickFilter. Shown here is how to run VEP version 94 with these recommended settings:
vep \
--offline \
--cache \
--dir_cache /my/installation/Ensembl/VEP/94 \
--fasta /my/installation/Ensembl/VEP/94/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz \
--i MyGenomes.vcf.gz --format vcf \
-o MyGenomes.vep.vcf.gz --vcf --compress_output bgzip --force_overwrite \
--species homo_sapiens \
--assembly GRCh37 \
--use_given_ref \
--merged \
--hgvs \
--af_gnomad \
--verbose
In order to ensure that the CSQ fields per allele/transcript contain exactly:
Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|REFSEQ_MATCH|SOURCE|HGVS_OFFSET|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|CLIN_SIG|SOMATIC|PHENO]
It is essential that that gnomAD_AF
is at index 26 and
SYMBOL
is at index 3 (0 based). The same annotations may also be produced by the VEP web service, but this is untested.
The input VCF file must be annotated by CAPICE. This can be locally installed via Easybuild or source code. The easiest way (though perhaps most labor-intensive in the long run) is to run the openly available CAPICE web service, found at:
https://molgenis.org/capice
The input it expects has the 5-column VCF layout:
1 216497582 . C A
5 157217708 . G GT
Beware that multi allelics get discarded, so check 'discardedInput' and run these again.
CAPICE web service output looks like this:
1 216497582 C A NON_SYNONYMOUS 0.9883755445480347
5 157217708 G GT INTRONIC 5.081834751763381e-05
This output is easily converted to VCF using a regex replace:
(.+?)\t(.+?)\t(.+?)\t(.+?)\t(.+?)\t(.+)
to
\1\t\2\t.\t\3\t\4\t.\t.\tCAPICE=\6
Of course, prepend a suitable header, so the final VCF file looks like this:
##fileformat=VCFv4.3
##fileDate=20200120
##CapiceVersion="1.0"
##contig=<ID=1,length=249250621,assembly=b37>
##contig=<ID=2,length=243199373,assembly=b37>
##contig=<ID=3,length=198022430,assembly=b37>
##contig=<ID=4,length=191154276,assembly=b37>
##contig=<ID=5,length=180915260,assembly=b37>
##contig=<ID=6,length=171115067,assembly=b37>
##contig=<ID=7,length=159138663,assembly=b37>
##contig=<ID=8,length=146364022,assembly=b37>
##contig=<ID=9,length=141213431,assembly=b37>
##contig=<ID=10,length=135534747,assembly=b37>
##contig=<ID=11,length=135006516,assembly=b37>
##contig=<ID=12,length=133851895,assembly=b37>
##contig=<ID=13,length=115169878,assembly=b37>
##contig=<ID=14,length=107349540,assembly=b37>
##contig=<ID=15,length=102531392,assembly=b37>
##contig=<ID=16,length=90354753,assembly=b37>
##contig=<ID=17,length=81195210,assembly=b37>
##contig=<ID=18,length=78077248,assembly=b37>
##contig=<ID=19,length=59128983,assembly=b37>
##contig=<ID=20,length=63025520,assembly=b37>
##contig=<ID=21,length=48129895,assembly=b37>
##contig=<ID=22,length=51304566,assembly=b37>
##contig=<ID=X,length=155270560,assembly=b37>
##contig=<ID=Y,length=59373566,assembly=b37>
##contig=<ID=MT,length=16569,assembly=b37>
##INFO=<ID=CAPICE,Number=1,Type=Float,Description="CAPICE score">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 216497582 . C A . . CAPICE=0.9883755445480347
5 157217708 . G GT . . CAPICE=5.081834751763381e-05
And be sure to sort the chromosomes and positions correctly using:
cat MyUnsortedCAPICE_Scores.vcf | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1V -k2,2n"}' > MyCAPICE_Scores.vcf
And then compress this file:
bgzip MyCAPICE_Scores.vcf
Now it is ready to be used by VCFAnno. A suitable configuration may look like this:
[[annotation]]
file="MyCAPICE_Scores.vcf.gz"
fields = ["CAPICE"]
ops=["self"]
names=["CAPICE"]
Finally, run VCFAnno using:
vcfanno ../vcfanno/CAPICE_conf.toml MyGenomes.vep.vcf.gz 2> vcfanno.log | bgzip > MyGenomes.vep.vcfanno.vcf.gz
When running CapiceQuickFilter, please supply 5 or 6 arguments:
- File location of your input .VCF.GZ file.
- Output file location. May not exist yet.
- CAPICE score threshold. Lower scoring variants are dropped. Suggesting 0.2 for 90% sensitivity.
- GnomAD allele frequency threshold. Higher frequency variants are dropped. Suggesting 0.05 to be safe.
- Case sample ID (ie. the proband, or index).
- [optional] Control sample ID(s), comma-separated if multiple.
So, in case of MyGenomes.vep.vcfanno.vcf.gz
, containing for example,
perhaps a sample quartet of unaffected mother, unaffected father, affected
child, and an unaffected sibling:
java -jar capice-quick-filter-0.0.1.jar MyGenomes.vep.vcfanno.vcf.gz MyGenomes_AffChild01.vcf 0.2 0.05 AffChild01 Father01,Mother01,Sib01
- Unit and integration testing
- Proper dependency management
- Proper cmdline option parsing
- Address issues mentioned in manual
- Address issues mentioned in code