Major updates in v2.3

Question

Major updates in v2.3

drpatelh opened this issue 2 years ago · comments

Harshil Patel commented 2 years ago

Please see below for a summary of changes.

Sara Monzón · Answer 1 · Wed Feb 09 2022 20:33:46 GMT+0800 (China Standard Time)

Major enhancements

Included strand-bias annotation for ivar

NGS data are prone to certain types of artifact variant calls, strand bias is a clear example. For example, all but one variant-supporting reads are on the reverse strand whereas reference-supporting reads are equally represented on both strands giving rise to a False positive scenario known as Strand bias [1].

Most nowadays variant callers support for strand-bias filtering, but ivar still lacks this functionality andersen-lab/ivar#5.

viralrecon new release offers now this funcionality taking this artifact into consideration while converting iVar variants tsv file to vcf format inside the ivar_variants_to_vcf.py script. In order to do that a Fisher exact test is performed and SB filter annotation is used for tagging variants with a significant strand-bias p-value < 0.05. Moreover a new INFO field is added with the p-value (p.e SB_pvalue=1e-05 ).

Note that variants are not filtered just an annotation to the FILTER field is added. If you want to filter the variants you need to do it afterwards using this tag.

Input tsv:

MN908947.3      17615   A       G       6       3       52      8487    3406    56      0.999176        8494    0       TRUE    cds-QHD43415.1  AAG     K       AGG     R
MN908947.3      18653   G       A       6302    1779    48      2757    903     41      0.297604        9264    0       TRUE    cds-QHD43415.1  CGC     R       CAC     H

Output vcf:

MN908947.3      17615   .       A       G       .       PASS    DP=8494:SB_pvalue=0.81896 GT:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ       1:6:3:52:8487:3406:56:0.999176
MN908947.3      18653   .       G       A       .       SB      DP=9264:SB_pvalue=1e-05 T:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ       1:6302:1779:48:2757:903:41:0.297604

Fisher exact test is based a contingency table as stated in the GATK literature [2]:

	Forward Strand	Reverse Strand	Total
Reference Allele	4523	1779	6302
Alternate Allele	1854	903	2757
Total	6377	2682	9059

Code - contigency legend:

REF_FW - Reference Allele Forward Strand
REF_RV - Reference Allele Reverse Strand
ALT_FW - Alternate Allele Forward Strand
ALT_RV - Alternate Allele Reverse Strand

Strand-bias filtering is not always a recommended filter for all type of experiments, amplicon data due to the enrichment preparation procedure based on PCRs are prone to strand-bias artifacts that not necessarily means a greater probability of a false positive, moreover amplicon experiments normally generates deep coverage data that does not need this type of filtering. That's we ivar_variants_to_vcf.py provides a new option --ignore-strand-bias for ignoring the fisher test, this parameter is set by default when --protocol amplicon.

Consecutive variants called by ivar belonging to the same codon are now collapsed in one line in order to fix annotation

During variant analysis of Sars-Cov-2 some complex variants as a the triplet nucleotide change which change the entire codon in the B.1.1.7 VOC, variant callers reports three nucleotide changes instead of just one change including the three nucleotide changes, with the subsequent wrong aminoacid annotation. This is also a known problem in ivar andersen-lab/ivar#92 which we have fixed in this new viralrecon release also through ivar_variants_to_vcf.py script.

Input tsv file with three variant lines and wrong aaannotation:

REGION	POS	REF	ALT	REF_CODON	ALT_CODON	REF_AA	ALT_AA
MN908947.3	28280	G	C	GAT	D	CAT	H
MN908947.3	28281	A	T	GAT	D	GTT	V
MN908947.3	28282	T	A	GAT	D	GAA	E

Output vcf with three variants belonging to the same codon merged in just one line:

MN908947.3      28280   .       GAT     CTA     .       PASS    DP=7610 GT:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ       1:6:3:34:7602:3852:35:0.998949

Fixed annotation with snpeff:

CHROM	POS	REF	ALT	GENE	EFFECT	HGVS_C	HGVS_P
MN908947.3	28280	GAT	CTA	N	missense_variant	c.7_9delGATinsCTA	p.Asp3Leu

As for the strand-bias implementation the script comes with the parameter --ignore-merge-codons if you want the previous ivar_variants_to_vcf.py behaviour.

Script logic for consecutive and same codon variants detection.

The script ivar_variants_to_vcf.py iterates through all the .tsv file reading each line. It saves the information of each line in the dictionary structure which will be filled with all the informative fields for up to 3 positions maximum. Once the dictionary meets this requirements we check for consecutive positions and evaluate if they belong to the same codon. The dictionary acts as a queue of size three, being evaluated always when it is full.

dict  ={
	‘CHROM': ['MN908947.3', 'MN908947.3', 'MN908947.3’], 
	'POS': [28280, '28281', '28282’],
	'REF': ['G', 'A', 'T’], 
	'ALT': ['C', 'T', 'A’], 
	'REF_CODON': [‘GAT', 'GAT’, 'GAT’],
	'ALT_CODON': [‘GAT’, 'GTT', 'GAA’],
 	}

Once the dict is full we evaluate as follows:

Option to generate consensus with BCFTools / BEDTools using iVar variants

Another new functionality is that viralrecon now allows to determine which software use for variant calling (iVar or Bcftools) and consensus genome generation (iVar or Bcftools), so you can combine them (#246).

Previous viralrecon versions had iVar as default for both variant calling and consensus genome generation. This combination had some drawbacks related with the issues associated with iVar (andersen-lab/ivar#103 , andersen-lab/ivar#97, andersen-lab/ivar#85).

Now, viralrecon performs variant calling using iVar, then it will filter those variants as explained before in strand-bias and merged codons, and finally it will generate the consensus genome using the filtered variants called by iVar. This generates the following differences in the final consensus fasta files:

Ivar includes low frequency deletions in the consensus even when the filter is applied: andersen-lab/ivar#83

This is fixed when creating the consensus with iVar filtered variants:

First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.

iVar's tsv file will look like this:

REGION	POS	REF	ALT	REF_DP	ALT_DP	ALT_FREQ
MN908947.3	25497	A	-CCGATACAAGCCTCACTCCCTTTCGGATGGCTT	19179	8470	0.434314

The deletion has frequency lower than 0.75 as determined in the consensus filter, but it is being added to the iVar consensus, but not with Bcftools consensus.

Ivar includes Ns when the position has enough coverage:

First sequence is reference, second sequence is the consensus generated by bcftools and third sequence is consensus generated with iVar.

iVar's .tsv file will look like this:

REGION	POS	REF	ALT	REF_DP	ALT_DP	ALT_FREQ
MN908947.3	28361	G	-GAGAACGCA	378	279	0.734211

It was supposed to be a deletion, not a N nucleotide, and the N will not appear when creating the consensus with Bcftools.

Ivar is calling for ambiguous nucleotides therefor adding low frequency variants:

As explained in iVar's manual if one base is not enough to match a given frequency, then an ambiguous nucleotide is called at that position, which means including low frequency variants. Example:

REGION       POS   REF ALT REF_DP REF_RV  REF_QUAL ALT_DP ALT_RV ALT_QUAL ALT_FREQ TOTAL_DP        PVAL    PASS    GFF_FEATURE     REF_CODON       REF_AA  ALT_CODON       ALT_AA
MN908947.3	27665	A	G	7	4	35	4	0	36	0.363636	11	0.0551378	FALSE	cds-QHD43421.1	GAG	E	GGG	G
MN908947.3	27666	G	C	7	4	35	4	0	32	0.363636	11	0.0551378	FALSE	cds-QHD43421.1	GAG	E	GAC	D

This variants are at 0.3 AF, so the reference nucleotide AF is not enough to reach the minimum 0.75 AF, then both are included in the consensus as ambiguous nucleotides:

First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.

iVar consensus is introducing R (A or G) in position 27665 and S (G or C) in position 27666 when the reference only should be included. This is fixed when creating the consensus with Bcftools.

iVar is including Ns in low frequency variants that should not be included in the consensus:

When there are deletions in iVar's tsv file with low allele frequency, the reference should be included, but iVar introduces Ns instead:

First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.

The tsv file looks like this:

REGION	POS	REF	ALT	REF_DP	ALT_DP	ALT_FREQ
MN908947.3	11287	G	-TCTGGTTTT	115	41	0.356522

In the consensus, the reference nucleotides should be included as with Bcftools.

New variants and linage report table

viralrecon now provides a new table for variants report unifying variant calling, annotation and linage if desired. This table can be really useful for variants inspection, co-infections or metagenomics data as sewage sars-cov-2 sequencing.

SAMPLE	CHROM	POS	REF	ALT	FILTER	DP	REF_DP	ALT_DP	AF	GENE	EFFECT	HGVS_C	HGVS_P	HGVS_P_1LETTER	CALLER	LINEAGE
218976	MN908947.3	23063	A	T	PASS	1150	5	1141	0.99	S	missense_variant	c.1501A>T	p.Asn501Tyr	p.N501Y	ivar	B.1.1.7
218976	MN908947.3	23271	C	A	PASS	12288	6	12196	0.99	S	missense_variant	c.1709C>A	p.Ala570Asp	p.A570D	ivar	B.1.1.7
218976	MN908947.3	23403	A	G	PASS	12982	24	12954	1.0	S	missense_variant	c.1841A>G	p.Asp614Gly	p.D614G	ivar	B.1.1.7
218976	MN908947.3	23604	C	A	PASS	4845	0	4829	1.0	S	missense_variant	c.2042C>A	p.Pro681His	p.P681H	ivar	B.1.1.7
218976	MN908947.3	23709	C	T	PASS	5083	8	5071	1.0	S	missense_variant	c.2147C>T	p.Thr716Ile	p.T716I	ivar	B.1.1.7
218976	MN908947.3	24506	T	G	PASS	829	0	829	1.0	S	missense_variant	c.2944T>G	p.Ser982Ala	p.S982A	ivar	B.1.1.7
218976	MN908947.3	24914	G	C	PASS	10641	3	10621	1.0	S	missense_variant	c.3352G>C	p.Asp1118His	p.D1118H	ivar	B.1.1.7
218976	MN908947.3	26013	C	T	PASS	260	2	258	0.99	ORF3a	synonymous_variant	c.621C>T	p.Phe207Phe	p.F207F	ivar	B.1.1.7
218976	MN908947.3	26060	C	T	PASS	333	1	332	1.0	ORF3a	missense_variant	c.668C>T	p.Thr223Ile	p.T223I	ivar	B.1.1.7
218976	MN908947.3	27972	C	T	PASS	978	2	975	1.0	ORF8	stop_gained	c.79C>T	p.Gln27*	p.Q27*	ivar	B.1.1.7
218976	MN908947.3	28048	G	T	PASS	908	0	905	1.0	ORF8	missense_variant	c.155G>T	p.Arg52Ile	p.R52I	ivar	B.1.1.7
218976	MN908947.3	28111	A	G	PASS	3745	10	3734	1.0	ORF8	missense_variant	c.218A>G	p.Tyr73Cys	p.Y73C	ivar	B.1.1.7
218976	MN908947.3	28270	TA	T	PASS	8843	8740	7788	0.88	N	upstream_gene_variant	c.-3delA	.	.	ivar	B.1.1.7
218976	MN908947.3	28280	GAT	CTA	PASS	7610	6	7602	1.0	N	missense_variant	c.7_9delGATinsCTA	p.Asp3Leu	p.D3L	ivar	B.1.1.7
218976	MN908947.3	28881	GG	AA	PASS	1011	7	1003	0.99	N	missense_variant	c.608_609delGGinsAA	p.Arg203Lys	p.R203K	ivar	B.1.1.7
218976	MN908947.3	28883	G	C	PASS	1028	0	1028	1.0	N	missense_variant	c.610G>C	p.Gly204Arg	p.G204R	ivar	B.1.1.7
218976	MN908947.3	28931	G	T	PASS	1142	0	1135	0.99	N	missense_variant	c.658G>T	p.Ala220Ser	p.A220S	ivar	B.1.1.7
218976	MN908947.3	28977	C	T	PASS	1041	9	1029	0.99	N	missense_variant	c.704C>T	p.Ser235Phe	p.S235F	ivar	B.1.1.7
218987	MN908947.3	210	G	T	PASS	2192	0	2184	1.0	orf1ab	upstream_gene_variant	c.-56G>T	.	.	ivar	AY.127
218987	MN908947.3	241	C	T	PASS	2059	10	2046	0.99	orf1ab	upstream_gene_variant	c.-25C>T	.	.	ivar	AY.127
218987	MN908947.3	1385	C	T	PASS	4852	16	4833	1.0	orf1ab	missense_variant	c.1120C>T	p.His374Tyr	p.H374Y	ivar	AY.127
218987	MN908947.3	1875	C	T	PASS	3791	1779	1890	0.5	orf1ab	missense_variant	c.1610C>T	p.Ala537Val	p.A537V	ivar	AY.127
218987	MN908947.3	2265	G	A	PASS	1445	912	515	0.36	orf1ab	missense_variant	c.2000G>A	p.Cys667Tyr	p.C667Y	ivar	AY.127
218987	MN908947.3	3037	C	T	PASS	463	1	462	1.0	orf1ab	synonymous_variant	c.2772C>T	p.Phe924Phe	p.F924F	ivar	AY.127
218987	MN908947.3	4181	G	T	PASS	2568	1	2557	1.0	orf1ab	missense_variant	c.3916G>T	p.Ala1306Ser	p.A1306S	ivar	AY.127
218987	MN908947.3	6402	C	T	PASS	13115	68	13037	0.99	orf1ab	missense_variant	c.6137C>T	p.Pro2046Leu	p.P2046L	ivar	AY.127
218987	MN908947.3	7124	C	T	PASS	1438	6	1428	0.99	orf1ab	missense_variant	c.6859C>T	p.Pro2287Ser	p.P2287S	ivar	AY.127
218987	MN908947.3	8986	C	T	PASS	1025	8	1015	0.99	orf1ab	synonymous_variant	c.8721C>T	p.Asp2907Asp	p.D2907D	ivar	AY.127
218987	MN908947.3	9053	G	T	PASS	1318	0	1317	1.0	orf1ab	missense_variant	c.8788G>T	p.Val2930Leu	p.V2930L	ivar	AY.127
218987	MN908947.3	10029	C	T	PASS	1852	1	1851	1.0	orf1ab	missense_variant	c.9764C>T	p.Thr3255Ile	p.T3255I	ivar	AY.127
218987	MN908947.3	10039	C	T	PASS	1860	2	1856	1.0	orf1ab	synonymous_variant	c.9774C>T	p.Thr3258Thr	p.T3258T	ivar	AY.127
218987	MN908947.3	10106	G	A	PASS	2877	14	2859	0.99	orf1ab	missense_variant	c.9841G>A	p.Val3281Ile	p.V3281I	ivar	AY.127
218987	MN908947.3	11201	A	G	PASS	747	4	741	0.99	orf1ab	missense_variant	c.10936A>G	p.Thr3646Ala	p.T3646A	ivar	AY.127
218987	MN908947.3	11332	A	G	PASS	905	1	904	1.0	orf1ab	synonymous_variant	c.11067A>G	p.Val3689Val	p.V3689V	ivar	AY.127

Pipeline validation and benchmarking

The pipeline has been validated using 54 SARS-Cov-2 samples using Artic amplicon scheme v4. This samples have a mixed composition of SARS-Cov-2 linages including B.1.1.7, AY.* and BA.*, which are known to have problematic deletions and triplets.

Bibliography:

[1] Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).

[2] Fisher’s Exact Test GATK Team (2020).

Special acknowledgement for this documentation to:
@svarona
@ErikaKvalem
@Alema91
@saramonzon
@drpatelh