Major updates in v2.3
drpatelh opened this issue · comments
Please see below for a summary of changes.
Major enhancements
Included strand-bias annotation for ivar
NGS data are prone to certain types of artifact variant calls, strand bias is a clear example. For example, all but one variant-supporting reads are on the reverse strand whereas reference-supporting reads are equally represented on both strands giving rise to a False positive scenario known as Strand bias [1].
Most nowadays variant callers support for strand-bias filtering, but ivar still lacks this functionality andersen-lab/ivar#5.
viralrecon new release offers now this funcionality taking this artifact into consideration while converting iVar variants tsv file to vcf format inside the ivar_variants_to_vcf.py script. In order to do that a Fisher exact test is performed and SB
filter annotation is used for tagging variants with a significant strand-bias p-value < 0.05. Moreover a new INFO field is added with the p-value (p.e SB_pvalue=1e-05
).
Note that variants are not filtered just an annotation to the FILTER field is added. If you want to filter the variants you need to do it afterwards using this tag.
Input tsv:
MN908947.3 17615 A G 6 3 52 8487 3406 56 0.999176 8494 0 TRUE cds-QHD43415.1 AAG K AGG R
MN908947.3 18653 G A 6302 1779 48 2757 903 41 0.297604 9264 0 TRUE cds-QHD43415.1 CGC R CAC H
Output vcf:
MN908947.3 17615 . A G . PASS DP=8494:SB_pvalue=0.81896 GT:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ 1:6:3:52:8487:3406:56:0.999176
MN908947.3 18653 . G A . SB DP=9264:SB_pvalue=1e-05 T:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ 1:6302:1779:48:2757:903:41:0.297604
Fisher exact test is based a contingency table as stated in the GATK literature [2]:
Forward Strand | Reverse Strand | Total | |
---|---|---|---|
Reference Allele | 4523 | 1779 | 6302 |
Alternate Allele | 1854 | 903 | 2757 |
Total | 6377 | 2682 | 9059 |
Code - contigency legend:
- REF_FW - Reference Allele Forward Strand
- REF_RV - Reference Allele Reverse Strand
- ALT_FW - Alternate Allele Forward Strand
- ALT_RV - Alternate Allele Reverse Strand
Strand-bias filtering is not always a recommended filter for all type of experiments, amplicon data due to the enrichment preparation procedure based on PCRs are prone to strand-bias artifacts that not necessarily means a greater probability of a false positive, moreover amplicon experiments normally generates deep coverage data that does not need this type of filtering. That's we ivar_variants_to_vcf.py
provides a new option --ignore-strand-bias
for ignoring the fisher test, this parameter is set by default when --protocol amplicon
.
Consecutive variants called by ivar belonging to the same codon are now collapsed in one line in order to fix annotation
During variant analysis of Sars-Cov-2 some complex variants as a the triplet nucleotide change which change the entire codon in the B.1.1.7 VOC, variant callers reports three nucleotide changes instead of just one change including the three nucleotide changes, with the subsequent wrong aminoacid annotation. This is also a known problem in ivar andersen-lab/ivar#92 which we have fixed in this new viralrecon release also through ivar_variants_to_vcf.py
script.
Input tsv file with three variant lines and wrong aa
annotation:
REGION | POS | REF | ALT | REF_CODON | ALT_CODON | REF_AA | ALT_AA |
---|---|---|---|---|---|---|---|
MN908947.3 | 28280 | G | C | GAT | D | CAT | H |
MN908947.3 | 28281 | A | T | GAT | D | GTT | V |
MN908947.3 | 28282 | T | A | GAT | D | GAA | E |
Output vcf with three variants belonging to the same codon merged in just one line:
MN908947.3 28280 . GAT CTA . PASS DP=7610 GT:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ 1:6:3:34:7602:3852:35:0.998949
Fixed annotation with snpeff:
CHROM | POS | REF | ALT | GENE | EFFECT | HGVS_C | HGVS_P |
---|---|---|---|---|---|---|---|
MN908947.3 | 28280 | GAT | CTA | N | missense_variant | c.7_9delGATinsCTA | p.Asp3Leu |
As for the strand-bias implementation the script comes with the parameter --ignore-merge-codons
if you want the previous ivar_variants_to_vcf.py
behaviour.
Script logic for consecutive and same codon variants detection.
The script ivar_variants_to_vcf.py
iterates through all the .tsv file reading each line. It saves the information of each line in the dictionary structure which will be filled with all the informative fields for up to 3 positions maximum. Once the dictionary meets this requirements we check for consecutive positions and evaluate if they belong to the same codon. The dictionary acts as a queue of size three, being evaluated always when it is full.
dict ={
‘CHROM': ['MN908947.3', 'MN908947.3', 'MN908947.3’],
'POS': [28280, '28281', '28282’],
'REF': ['G', 'A', 'T’],
'ALT': ['C', 'T', 'A’],
'REF_CODON': [‘GAT', 'GAT’, 'GAT’],
'ALT_CODON': [‘GAT’, 'GTT', 'GAA’],
}
Once the dict is full we evaluate as follows:
Option to generate consensus with BCFTools / BEDTools using iVar variants
Another new functionality is that viralrecon now allows to determine which software use for variant calling (iVar or Bcftools) and consensus genome generation (iVar or Bcftools), so you can combine them (#246).
Previous viralrecon versions had iVar as default for both variant calling and consensus genome generation. This combination had some drawbacks related with the issues associated with iVar (andersen-lab/ivar#103 , andersen-lab/ivar#97, andersen-lab/ivar#85).
Now, viralrecon performs variant calling using iVar, then it will filter those variants as explained before in strand-bias and merged codons, and finally it will generate the consensus genome using the filtered variants called by iVar. This generates the following differences in the final consensus fasta files:
- Ivar includes low frequency deletions in the consensus even when the filter is applied: andersen-lab/ivar#83
This is fixed when creating the consensus with iVar filtered variants:
First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.
iVar's tsv file will look like this:
REGION POS REF ALT REF_DP ALT_DP ALT_FREQ
MN908947.3 25497 A -CCGATACAAGCCTCACTCCCTTTCGGATGGCTT 19179 8470 0.434314
The deletion has frequency lower than 0.75 as determined in the consensus filter, but it is being added to the iVar consensus, but not with Bcftools consensus.
- Ivar includes Ns when the position has enough coverage:
First sequence is reference, second sequence is the consensus generated by bcftools and third sequence is consensus generated with iVar.
iVar's .tsv file will look like this:
REGION POS REF ALT REF_DP ALT_DP ALT_FREQ
MN908947.3 28361 G -GAGAACGCA 378 279 0.734211
It was supposed to be a deletion, not a N nucleotide, and the N will not appear when creating the consensus with Bcftools.
- Ivar is calling for ambiguous nucleotides therefor adding low frequency variants:
As explained in iVar's manual if one base is not enough to match a given frequency, then an ambiguous nucleotide is called at that position, which means including low frequency variants. Example:
REGION POS REF ALT REF_DP REF_RV REF_QUAL ALT_DP ALT_RV ALT_QUAL ALT_FREQ TOTAL_DP PVAL PASS GFF_FEATURE REF_CODON REF_AA ALT_CODON ALT_AA
MN908947.3 27665 A G 7 4 35 4 0 36 0.363636 11 0.0551378 FALSE cds-QHD43421.1 GAG E GGG G
MN908947.3 27666 G C 7 4 35 4 0 32 0.363636 11 0.0551378 FALSE cds-QHD43421.1 GAG E GAC D
This variants are at 0.3 AF, so the reference nucleotide AF is not enough to reach the minimum 0.75 AF, then both are included in the consensus as ambiguous nucleotides:
First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.
iVar consensus is introducing R (A or G) in position 27665 and S (G or C) in position 27666 when the reference only should be included. This is fixed when creating the consensus with Bcftools.
- iVar is including Ns in low frequency variants that should not be included in the consensus:
When there are deletions in iVar's tsv file with low allele frequency, the reference should be included, but iVar introduces Ns instead:
First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.
The tsv file looks like this:
REGION POS REF ALT REF_DP ALT_DP ALT_FREQ
MN908947.3 11287 G -TCTGGTTTT 115 41 0.356522
In the consensus, the reference nucleotides should be included as with Bcftools.
New variants and linage report table
viralrecon now provides a new table for variants report unifying variant calling, annotation and linage if desired. This table can be really useful for variants inspection, co-infections or metagenomics data as sewage sars-cov-2 sequencing.
SAMPLE | CHROM | POS | REF | ALT | FILTER | DP | REF_DP | ALT_DP | AF | GENE | EFFECT | HGVS_C | HGVS_P | HGVS_P_1LETTER | CALLER | LINEAGE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
218976 | MN908947.3 | 23063 | A | T | PASS | 1150 | 5 | 1141 | 0.99 | S | missense_variant | c.1501A>T | p.Asn501Tyr | p.N501Y | ivar | B.1.1.7 |
218976 | MN908947.3 | 23271 | C | A | PASS | 12288 | 6 | 12196 | 0.99 | S | missense_variant | c.1709C>A | p.Ala570Asp | p.A570D | ivar | B.1.1.7 |
218976 | MN908947.3 | 23403 | A | G | PASS | 12982 | 24 | 12954 | 1.0 | S | missense_variant | c.1841A>G | p.Asp614Gly | p.D614G | ivar | B.1.1.7 |
218976 | MN908947.3 | 23604 | C | A | PASS | 4845 | 0 | 4829 | 1.0 | S | missense_variant | c.2042C>A | p.Pro681His | p.P681H | ivar | B.1.1.7 |
218976 | MN908947.3 | 23709 | C | T | PASS | 5083 | 8 | 5071 | 1.0 | S | missense_variant | c.2147C>T | p.Thr716Ile | p.T716I | ivar | B.1.1.7 |
218976 | MN908947.3 | 24506 | T | G | PASS | 829 | 0 | 829 | 1.0 | S | missense_variant | c.2944T>G | p.Ser982Ala | p.S982A | ivar | B.1.1.7 |
218976 | MN908947.3 | 24914 | G | C | PASS | 10641 | 3 | 10621 | 1.0 | S | missense_variant | c.3352G>C | p.Asp1118His | p.D1118H | ivar | B.1.1.7 |
218976 | MN908947.3 | 26013 | C | T | PASS | 260 | 2 | 258 | 0.99 | ORF3a | synonymous_variant | c.621C>T | p.Phe207Phe | p.F207F | ivar | B.1.1.7 |
218976 | MN908947.3 | 26060 | C | T | PASS | 333 | 1 | 332 | 1.0 | ORF3a | missense_variant | c.668C>T | p.Thr223Ile | p.T223I | ivar | B.1.1.7 |
218976 | MN908947.3 | 27972 | C | T | PASS | 978 | 2 | 975 | 1.0 | ORF8 | stop_gained | c.79C>T | p.Gln27* | p.Q27* | ivar | B.1.1.7 |
218976 | MN908947.3 | 28048 | G | T | PASS | 908 | 0 | 905 | 1.0 | ORF8 | missense_variant | c.155G>T | p.Arg52Ile | p.R52I | ivar | B.1.1.7 |
218976 | MN908947.3 | 28111 | A | G | PASS | 3745 | 10 | 3734 | 1.0 | ORF8 | missense_variant | c.218A>G | p.Tyr73Cys | p.Y73C | ivar | B.1.1.7 |
218976 | MN908947.3 | 28270 | TA | T | PASS | 8843 | 8740 | 7788 | 0.88 | N | upstream_gene_variant | c.-3delA | . | . | ivar | B.1.1.7 |
218976 | MN908947.3 | 28280 | GAT | CTA | PASS | 7610 | 6 | 7602 | 1.0 | N | missense_variant | c.7_9delGATinsCTA | p.Asp3Leu | p.D3L | ivar | B.1.1.7 |
218976 | MN908947.3 | 28881 | GG | AA | PASS | 1011 | 7 | 1003 | 0.99 | N | missense_variant | c.608_609delGGinsAA | p.Arg203Lys | p.R203K | ivar | B.1.1.7 |
218976 | MN908947.3 | 28883 | G | C | PASS | 1028 | 0 | 1028 | 1.0 | N | missense_variant | c.610G>C | p.Gly204Arg | p.G204R | ivar | B.1.1.7 |
218976 | MN908947.3 | 28931 | G | T | PASS | 1142 | 0 | 1135 | 0.99 | N | missense_variant | c.658G>T | p.Ala220Ser | p.A220S | ivar | B.1.1.7 |
218976 | MN908947.3 | 28977 | C | T | PASS | 1041 | 9 | 1029 | 0.99 | N | missense_variant | c.704C>T | p.Ser235Phe | p.S235F | ivar | B.1.1.7 |
218987 | MN908947.3 | 210 | G | T | PASS | 2192 | 0 | 2184 | 1.0 | orf1ab | upstream_gene_variant | c.-56G>T | . | . | ivar | AY.127 |
218987 | MN908947.3 | 241 | C | T | PASS | 2059 | 10 | 2046 | 0.99 | orf1ab | upstream_gene_variant | c.-25C>T | . | . | ivar | AY.127 |
218987 | MN908947.3 | 1385 | C | T | PASS | 4852 | 16 | 4833 | 1.0 | orf1ab | missense_variant | c.1120C>T | p.His374Tyr | p.H374Y | ivar | AY.127 |
218987 | MN908947.3 | 1875 | C | T | PASS | 3791 | 1779 | 1890 | 0.5 | orf1ab | missense_variant | c.1610C>T | p.Ala537Val | p.A537V | ivar | AY.127 |
218987 | MN908947.3 | 2265 | G | A | PASS | 1445 | 912 | 515 | 0.36 | orf1ab | missense_variant | c.2000G>A | p.Cys667Tyr | p.C667Y | ivar | AY.127 |
218987 | MN908947.3 | 3037 | C | T | PASS | 463 | 1 | 462 | 1.0 | orf1ab | synonymous_variant | c.2772C>T | p.Phe924Phe | p.F924F | ivar | AY.127 |
218987 | MN908947.3 | 4181 | G | T | PASS | 2568 | 1 | 2557 | 1.0 | orf1ab | missense_variant | c.3916G>T | p.Ala1306Ser | p.A1306S | ivar | AY.127 |
218987 | MN908947.3 | 6402 | C | T | PASS | 13115 | 68 | 13037 | 0.99 | orf1ab | missense_variant | c.6137C>T | p.Pro2046Leu | p.P2046L | ivar | AY.127 |
218987 | MN908947.3 | 7124 | C | T | PASS | 1438 | 6 | 1428 | 0.99 | orf1ab | missense_variant | c.6859C>T | p.Pro2287Ser | p.P2287S | ivar | AY.127 |
218987 | MN908947.3 | 8986 | C | T | PASS | 1025 | 8 | 1015 | 0.99 | orf1ab | synonymous_variant | c.8721C>T | p.Asp2907Asp | p.D2907D | ivar | AY.127 |
218987 | MN908947.3 | 9053 | G | T | PASS | 1318 | 0 | 1317 | 1.0 | orf1ab | missense_variant | c.8788G>T | p.Val2930Leu | p.V2930L | ivar | AY.127 |
218987 | MN908947.3 | 10029 | C | T | PASS | 1852 | 1 | 1851 | 1.0 | orf1ab | missense_variant | c.9764C>T | p.Thr3255Ile | p.T3255I | ivar | AY.127 |
218987 | MN908947.3 | 10039 | C | T | PASS | 1860 | 2 | 1856 | 1.0 | orf1ab | synonymous_variant | c.9774C>T | p.Thr3258Thr | p.T3258T | ivar | AY.127 |
218987 | MN908947.3 | 10106 | G | A | PASS | 2877 | 14 | 2859 | 0.99 | orf1ab | missense_variant | c.9841G>A | p.Val3281Ile | p.V3281I | ivar | AY.127 |
218987 | MN908947.3 | 11201 | A | G | PASS | 747 | 4 | 741 | 0.99 | orf1ab | missense_variant | c.10936A>G | p.Thr3646Ala | p.T3646A | ivar | AY.127 |
218987 | MN908947.3 | 11332 | A | G | PASS | 905 | 1 | 904 | 1.0 | orf1ab | synonymous_variant | c.11067A>G | p.Val3689Val | p.V3689V | ivar | AY.127 |
Pipeline validation and benchmarking
The pipeline has been validated using 54 SARS-Cov-2 samples using Artic amplicon scheme v4. This samples have a mixed composition of SARS-Cov-2 linages including B.1.1.7, AY.* and BA.*, which are known to have problematic deletions and triplets.
Bibliography:
[1] Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).
[2] Fisher’s Exact Test GATK Team (2020).
Special acknowledgement for this documentation to:
@svarona
@ErikaKvalem
@Alema91
@saramonzon
@drpatelh