████████╗██████╗ ██╗ ██╗██╗ ██╗ █████╗ ██████╗ ██╗
╚══██╔══╝██╔══██╗██║ ██║██║ ██║██╔══██╗██╔══██╗██║
██║ ██████╔╝██║ ██║██║ ██║███████║██████╔╝██║
██║ ██╔══██╗██║ ██║╚██╗ ██╔╝██╔══██║██╔══██╗██║
██║ ██║ ██║╚██████╔╝ ╚████╔╝ ██║ ██║██║ ██║██║
╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═══╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝
Structural variant comparison tool for VCFs
Given benchmark and comparsion sets of SVs, calculate the sensitivity/specificity/f-measure.
Spiral Genetics, 2018
Installation
Truvari uses Python 2 or 3 and requires the following modules:
$ pip install pyvcf python-Levenshtein swalign intervaltree progressbar2 pysam
Note that --use-swalign
is not compatible with Python 3.
Quick start
$ ./truvari.py -b base_calls.vcf -c compare_calls.vcf -o output_dir/
Outputs
- tp-call.vcf -- annotated true positive calls from the COMP
- tp-base.vcf -- anotated true positive calls form the BASE
- fn.vcf -- false negative calls from BASE
- fp.vcf -- false positive calls from COMP
- base-filter.vcf -- size filtered calls from BASE
- call-filter.vcf -- size filtered calls from COMP
- summary.txt -- json output of performance stats
- log.txt -- run log
- giab_report.txt -- (optional) Summary of GIAB benchmark calls. See below for details.
Methodology
Input:
BaseCall - Benchmark TruthSet of SVs
CompCalls - Comparison SVs from another program
Build IntervalTree of CompCalls
For each BaseCall:
Fetch CompCalls overlapping within *refdist*.
If typematch and LevDistRatio >= *pctsim* \
and SizeRatio >= *pctsize* and PctRecOvl >= *pctovl*:
Add CompCall to list of Neighbors
Sort list of Neighbors by TruScore ((2*sim + 1*size + 1*ovl) / 3.0)
Take CompCall with highest TruScore and BaseCall as TPs
Only use a CompCall once
If no neighbors: BaseCall is FN
For each CompCall:
If not used: mark as FP
Matching Parameters
Parameter | Default | Definition |
---|---|---|
refdist | 500 | Maximum distance comparison calls must be within from base call's start/end |
pctsim | 0.7 | Levenshtein distance ratio between the REF or ALT sequence of base and comparison call. Longer sequence of the two is used. |
pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) |
pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) |
typeignore | False | Types don't need to match to compare calls. |
Comparing VCFs without sequence resolved calls
If the base or comp vcfs do not have sequence resolved calls, simply set --pctsim=0
to turn off
sequence comparison.
Difference between --sizemin and --sizefilt
--sizemin
is the minimum size of a base call or comparison call to be considered.
--sizefilt
is the minimum size of a call to be added into the IntervalTree for searching. It should
be less than sizemin
for edge case variants.
For example: sizemin
is 50 and sizefilt
is 30. A 50bp base call is 98% similar to a 49bp call at
the same position.
These two calls should be considered matching. If we instead removed calls less than sizemin
, we'd
incorrectly classify the 50bp base call as a false negative.
This does have the side effect of artificially inflating specificity. If that same 49bp call in the
above were below the similarity threshold, it would not be classified as a FP due to the sizemin
threshold. So we're giving the call a better chance to be useful and less chance to be detrimental
to final statistics.
Definition of annotations added to TP vcfs
Anno | Definition |
---|---|
TruScore | Truvari score for similarity of match. `((2*sim + 1*size + 1*ovl) / 3.0)` |
PctSeqSimilarity | Pct sequence similarity between this variant and its closest match |
PctSizeSimilarity | Pct size similarity between this variant and it's closest match |
PctRecOverlap | Percent reciprocal overlap of the two calls' coordinates |
StartDistance | Distance of this call's start from matching call's start |
EndDistance | Distance of this call's end from matching call's end |
SizeDiff | Difference in size(basecall) and size(compcall) |
NumNeighbors | Number of comparison calls that were in the neighborhood (REFDIST) of the base call |
NumThresholdNeighbors | Number of comparison calls that passed threshold matching of the base call |
NumNeighbors and NumThresholdNeighbors are also added to the FN vcf.
Using the GIAB Report
When running against the GIAB SV v0.5 benchmark (link below), you can create a detailed report of calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations.
To create this report.
- Run Truvari with the flag
--giabreport
. - In your output directory, you will find a file named
giab_report.txt
. - Next, make a copy of the Truvari Report Template Google Sheet.
- Finally, paste ALL of the information inside
giab_report.txt
into the "RawData" tab. Be careful not to alter the report text in any way. If successul, the "Formatted" tab you will have a fully formated report.
This currently only works with GIAB SV v0.5. Work will need to be done to ensure Truvari can parse future GIAB SV releases.
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
Include Bed & VCF Header Contigs
If an --includebed
is provided, only base and comp calls overlapping the defined regions are used
for comparison. This is equilavent to pre-filtering your base/comp calls with:
(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz
If an --includebed
is not provided, the comparison is restricted to only the contigs present in the base VCF
header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary
statistics and will not be present in any output vcfs.