nf-core / viralrecon

Assembly and intrahost/low-frequency variant calling for viral samples

Home Page:https://nf-co.re/viralrecon

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Major updates in v2.3

drpatelh opened this issue · comments

Please see below for a summary of changes.

Major enhancements

Included strand-bias annotation for ivar

NGS data are prone to certain types of artifact variant calls, strand bias is a clear example. For example, all but one variant-supporting reads are on the reverse strand whereas reference-supporting reads are equally represented on both strands giving rise to a False positive scenario known as Strand bias [1].

Most nowadays variant callers support for strand-bias filtering, but ivar still lacks this functionality andersen-lab/ivar#5.

viralrecon new release offers now this funcionality taking this artifact into consideration while converting iVar variants tsv file to vcf format inside the ivar_variants_to_vcf.py script. In order to do that a Fisher exact test is performed and SB filter annotation is used for tagging variants with a significant strand-bias p-value < 0.05. Moreover a new INFO field is added with the p-value (p.e SB_pvalue=1e-05 ).

Note that variants are not filtered just an annotation to the FILTER field is added. If you want to filter the variants you need to do it afterwards using this tag.

Input tsv:

MN908947.3      17615   A       G       6       3       52      8487    3406    56      0.999176        8494    0       TRUE    cds-QHD43415.1  AAG     K       AGG     R
MN908947.3      18653   G       A       6302    1779    48      2757    903     41      0.297604        9264    0       TRUE    cds-QHD43415.1  CGC     R       CAC     H

Output vcf:

MN908947.3      17615   .       A       G       .       PASS    DP=8494:SB_pvalue=0.81896 GT:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ       1:6:3:52:8487:3406:56:0.999176
MN908947.3      18653   .       G       A       .       SB      DP=9264:SB_pvalue=1e-05 T:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ       1:6302:1779:48:2757:903:41:0.297604

Fisher exact test is based a contingency table as stated in the GATK literature [2]:

Forward Strand Reverse Strand Total
Reference Allele 4523 1779 6302
Alternate Allele 1854 903 2757
Total 6377 2682 9059

Code - contigency legend:

  • REF_FW - Reference Allele Forward Strand
  • REF_RV - Reference Allele Reverse Strand
  • ALT_FW - Alternate Allele Forward Strand
  • ALT_RV - Alternate Allele Reverse Strand

Strand-bias filtering is not always a recommended filter for all type of experiments, amplicon data due to the enrichment preparation procedure based on PCRs are prone to strand-bias artifacts that not necessarily means a greater probability of a false positive, moreover amplicon experiments normally generates deep coverage data that does not need this type of filtering. That's we ivar_variants_to_vcf.py provides a new option --ignore-strand-bias for ignoring the fisher test, this parameter is set by default when --protocol amplicon.

Consecutive variants called by ivar belonging to the same codon are now collapsed in one line in order to fix annotation

During variant analysis of Sars-Cov-2 some complex variants as a the triplet nucleotide change which change the entire codon in the B.1.1.7 VOC, variant callers reports three nucleotide changes instead of just one change including the three nucleotide changes, with the subsequent wrong aminoacid annotation. This is also a known problem in ivar andersen-lab/ivar#92 which we have fixed in this new viralrecon release also through ivar_variants_to_vcf.py script.

Input tsv file with three variant lines and wrong aaannotation:

REGION POS REF ALT REF_CODON ALT_CODON REF_AA ALT_AA
MN908947.3 28280 G C GAT D CAT H
MN908947.3 28281 A T GAT D GTT V
MN908947.3 28282 T A GAT D GAA E

Output vcf with three variants belonging to the same codon merged in just one line:

MN908947.3      28280   .       GAT     CTA     .       PASS    DP=7610 GT:REF_DP:REF_RV:REF_QUAL:ALT_DP:ALT_RV:ALT_QUAL:ALT_FREQ       1:6:3:34:7602:3852:35:0.998949

Fixed annotation with snpeff:

CHROM POS REF ALT GENE EFFECT HGVS_C HGVS_P
MN908947.3 28280 GAT CTA N missense_variant c.7_9delGATinsCTA p.Asp3Leu

As for the strand-bias implementation the script comes with the parameter --ignore-merge-codons if you want the previous ivar_variants_to_vcf.py behaviour.

Script logic for consecutive and same codon variants detection.

The script ivar_variants_to_vcf.py iterates through all the .tsv file reading each line. It saves the information of each line in the dictionary structure which will be filled with all the informative fields for up to 3 positions maximum. Once the dictionary meets this requirements we check for consecutive positions and evaluate if they belong to the same codon. The dictionary acts as a queue of size three, being evaluated always when it is full.

dict  ={
	‘CHROM': ['MN908947.3', 'MN908947.3', 'MN908947.3’], 
	'POS': [28280, '28281', '28282’],
	'REF': ['G', 'A', 'T’], 
	'ALT': ['C', 'T', 'A’], 
	'REF_CODON': [‘GAT', 'GAT’, 'GAT’],
	'ALT_CODON': [‘GAT’, 'GTT', 'GAA’],
 	}

Once the dict is full we evaluate as follows:
image

Option to generate consensus with BCFTools / BEDTools using iVar variants

Another new functionality is that viralrecon now allows to determine which software use for variant calling (iVar or Bcftools) and consensus genome generation (iVar or Bcftools), so you can combine them (#246).

Previous viralrecon versions had iVar as default for both variant calling and consensus genome generation. This combination had some drawbacks related with the issues associated with iVar (andersen-lab/ivar#103 , andersen-lab/ivar#97, andersen-lab/ivar#85).

Now, viralrecon performs variant calling using iVar, then it will filter those variants as explained before in strand-bias and merged codons, and finally it will generate the consensus genome using the filtered variants called by iVar. This generates the following differences in the final consensus fasta files:

  • Ivar includes low frequency deletions in the consensus even when the filter is applied: andersen-lab/ivar#83

This is fixed when creating the consensus with iVar filtered variants:

ivar_deletion

First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.

iVar's tsv file will look like this:

REGION	POS	REF	ALT	REF_DP	ALT_DP	ALT_FREQ
MN908947.3	25497	A	-CCGATACAAGCCTCACTCCCTTTCGGATGGCTT	19179	8470	0.434314

The deletion has frequency lower than 0.75 as determined in the consensus filter, but it is being added to the iVar consensus, but not with Bcftools consensus.

  • Ivar includes Ns when the position has enough coverage:

ivar_del_lofreq

First sequence is reference, second sequence is the consensus generated by bcftools and third sequence is consensus generated with iVar.

iVar's .tsv file will look like this:

REGION	POS	REF	ALT	REF_DP	ALT_DP	ALT_FREQ
MN908947.3	28361	G	-GAGAACGCA	378	279	0.734211

It was supposed to be a deletion, not a N nucleotide, and the N will not appear when creating the consensus with Bcftools.

  • Ivar is calling for ambiguous nucleotides therefor adding low frequency variants:

As explained in iVar's manual if one base is not enough to match a given frequency, then an ambiguous nucleotide is called at that position, which means including low frequency variants. Example:

REGION       POS   REF ALT REF_DP REF_RV  REF_QUAL ALT_DP ALT_RV ALT_QUAL ALT_FREQ TOTAL_DP        PVAL    PASS    GFF_FEATURE     REF_CODON       REF_AA  ALT_CODON       ALT_AA
MN908947.3	27665	A	G	7	4	35	4	0	36	0.363636	11	0.0551378	FALSE	cds-QHD43421.1	GAG	E	GGG	G
MN908947.3	27666	G	C	7	4	35	4	0	32	0.363636	11	0.0551378	FALSE	cds-QHD43421.1	GAG	E	GAC	D

This variants are at 0.3 AF, so the reference nucleotide AF is not enough to reach the minimum 0.75 AF, then both are included in the consensus as ambiguous nucleotides:

ivar_ambiguous

First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.

iVar consensus is introducing R (A or G) in position 27665 and S (G or C) in position 27666 when the reference only should be included. This is fixed when creating the consensus with Bcftools.

  • iVar is including Ns in low frequency variants that should not be included in the consensus:

When there are deletions in iVar's tsv file with low allele frequency, the reference should be included, but iVar introduces Ns instead:

ivar_ns

First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.

The tsv file looks like this:

REGION	POS	REF	ALT	REF_DP	ALT_DP	ALT_FREQ
MN908947.3	11287	G	-TCTGGTTTT	115	41	0.356522

In the consensus, the reference nucleotides should be included as with Bcftools.

New variants and linage report table

viralrecon now provides a new table for variants report unifying variant calling, annotation and linage if desired. This table can be really useful for variants inspection, co-infections or metagenomics data as sewage sars-cov-2 sequencing.

SAMPLE CHROM POS REF ALT FILTER DP REF_DP ALT_DP AF GENE EFFECT HGVS_C HGVS_P HGVS_P_1LETTER CALLER LINEAGE
218976 MN908947.3 23063 A T PASS 1150 5 1141 0.99 S missense_variant c.1501A>T p.Asn501Tyr p.N501Y ivar B.1.1.7
218976 MN908947.3 23271 C A PASS 12288 6 12196 0.99 S missense_variant c.1709C>A p.Ala570Asp p.A570D ivar B.1.1.7
218976 MN908947.3 23403 A G PASS 12982 24 12954 1.0 S missense_variant c.1841A>G p.Asp614Gly p.D614G ivar B.1.1.7
218976 MN908947.3 23604 C A PASS 4845 0 4829 1.0 S missense_variant c.2042C>A p.Pro681His p.P681H ivar B.1.1.7
218976 MN908947.3 23709 C T PASS 5083 8 5071 1.0 S missense_variant c.2147C>T p.Thr716Ile p.T716I ivar B.1.1.7
218976 MN908947.3 24506 T G PASS 829 0 829 1.0 S missense_variant c.2944T>G p.Ser982Ala p.S982A ivar B.1.1.7
218976 MN908947.3 24914 G C PASS 10641 3 10621 1.0 S missense_variant c.3352G>C p.Asp1118His p.D1118H ivar B.1.1.7
218976 MN908947.3 26013 C T PASS 260 2 258 0.99 ORF3a synonymous_variant c.621C>T p.Phe207Phe p.F207F ivar B.1.1.7
218976 MN908947.3 26060 C T PASS 333 1 332 1.0 ORF3a missense_variant c.668C>T p.Thr223Ile p.T223I ivar B.1.1.7
218976 MN908947.3 27972 C T PASS 978 2 975 1.0 ORF8 stop_gained c.79C>T p.Gln27* p.Q27* ivar B.1.1.7
218976 MN908947.3 28048 G T PASS 908 0 905 1.0 ORF8 missense_variant c.155G>T p.Arg52Ile p.R52I ivar B.1.1.7
218976 MN908947.3 28111 A G PASS 3745 10 3734 1.0 ORF8 missense_variant c.218A>G p.Tyr73Cys p.Y73C ivar B.1.1.7
218976 MN908947.3 28270 TA T PASS 8843 8740 7788 0.88 N upstream_gene_variant c.-3delA . . ivar B.1.1.7
218976 MN908947.3 28280 GAT CTA PASS 7610 6 7602 1.0 N missense_variant c.7_9delGATinsCTA p.Asp3Leu p.D3L ivar B.1.1.7
218976 MN908947.3 28881 GG AA PASS 1011 7 1003 0.99 N missense_variant c.608_609delGGinsAA p.Arg203Lys p.R203K ivar B.1.1.7
218976 MN908947.3 28883 G C PASS 1028 0 1028 1.0 N missense_variant c.610G>C p.Gly204Arg p.G204R ivar B.1.1.7
218976 MN908947.3 28931 G T PASS 1142 0 1135 0.99 N missense_variant c.658G>T p.Ala220Ser p.A220S ivar B.1.1.7
218976 MN908947.3 28977 C T PASS 1041 9 1029 0.99 N missense_variant c.704C>T p.Ser235Phe p.S235F ivar B.1.1.7
218987 MN908947.3 210 G T PASS 2192 0 2184 1.0 orf1ab upstream_gene_variant c.-56G>T . . ivar AY.127
218987 MN908947.3 241 C T PASS 2059 10 2046 0.99 orf1ab upstream_gene_variant c.-25C>T . . ivar AY.127
218987 MN908947.3 1385 C T PASS 4852 16 4833 1.0 orf1ab missense_variant c.1120C>T p.His374Tyr p.H374Y ivar AY.127
218987 MN908947.3 1875 C T PASS 3791 1779 1890 0.5 orf1ab missense_variant c.1610C>T p.Ala537Val p.A537V ivar AY.127
218987 MN908947.3 2265 G A PASS 1445 912 515 0.36 orf1ab missense_variant c.2000G>A p.Cys667Tyr p.C667Y ivar AY.127
218987 MN908947.3 3037 C T PASS 463 1 462 1.0 orf1ab synonymous_variant c.2772C>T p.Phe924Phe p.F924F ivar AY.127
218987 MN908947.3 4181 G T PASS 2568 1 2557 1.0 orf1ab missense_variant c.3916G>T p.Ala1306Ser p.A1306S ivar AY.127
218987 MN908947.3 6402 C T PASS 13115 68 13037 0.99 orf1ab missense_variant c.6137C>T p.Pro2046Leu p.P2046L ivar AY.127
218987 MN908947.3 7124 C T PASS 1438 6 1428 0.99 orf1ab missense_variant c.6859C>T p.Pro2287Ser p.P2287S ivar AY.127
218987 MN908947.3 8986 C T PASS 1025 8 1015 0.99 orf1ab synonymous_variant c.8721C>T p.Asp2907Asp p.D2907D ivar AY.127
218987 MN908947.3 9053 G T PASS 1318 0 1317 1.0 orf1ab missense_variant c.8788G>T p.Val2930Leu p.V2930L ivar AY.127
218987 MN908947.3 10029 C T PASS 1852 1 1851 1.0 orf1ab missense_variant c.9764C>T p.Thr3255Ile p.T3255I ivar AY.127
218987 MN908947.3 10039 C T PASS 1860 2 1856 1.0 orf1ab synonymous_variant c.9774C>T p.Thr3258Thr p.T3258T ivar AY.127
218987 MN908947.3 10106 G A PASS 2877 14 2859 0.99 orf1ab missense_variant c.9841G>A p.Val3281Ile p.V3281I ivar AY.127
218987 MN908947.3 11201 A G PASS 747 4 741 0.99 orf1ab missense_variant c.10936A>G p.Thr3646Ala p.T3646A ivar AY.127
218987 MN908947.3 11332 A G PASS 905 1 904 1.0 orf1ab synonymous_variant c.11067A>G p.Val3689Val p.V3689V ivar AY.127

Pipeline validation and benchmarking

The pipeline has been validated using 54 SARS-Cov-2 samples using Artic amplicon scheme v4. This samples have a mixed composition of SARS-Cov-2 linages including B.1.1.7, AY.* and BA.*, which are known to have problematic deletions and triplets.

image

image

Bibliography:

[1] Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).

[2] Fisher’s Exact Test GATK Team (2020).

Special acknowledgement for this documentation to:
@svarona
@ErikaKvalem
@Alema91
@saramonzon
@drpatelh