aquaskyline / Clairvoyante

Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allele frequency greater than 1

opened this issue · comments

Hello,

for some of my calls, Clairvoyante reports af greater than 1

1.3333
1.3333
1.3333
1.1667
1.1667
1.1667
1.1667
1.1667
1.1429
1.1111

It's a bit difficult to interpret. How are the af computed?
Here is my command
clairvoyante.py callVarBamParallel --chkpnt_fn trainedModels/fullv3-illumina-novoalign-hg002-hg38/learningRate1e-3.epoch999 --ref_fn A_set.fa --bam_fn A.sorted.bam --output_prefix Aset --threshold 0 --minCoverage 1 --includingAllContigs --tensorflowThreads 4

Thanks for your help.

EDIT: here is the line from the vcf
A_Seg295 5570899 . T TC 138 . . GT:GQ:DP:AF 1/1:138:7:1.1429

EDIT2: from another vcf, with more lines

bcftools view Aset.clairvoyante.vcf.gz |awk -F"\t|:" '{if($16 > 1) print $0}'     
A_Seg257	743430	.	C	CA	999	.	.	GT:GQ:DP:AF	1/1:999:138:1.0072
A_Seg281	115501	.	C	CT	999	.	.	GT:GQ:DP:AF	1/1:999:79:1.0127
A_Seg281	165545	.	T	TC	295	.	.	GT:GQ:DP:AF	1/1:295:15:1.0667
A_Seg283	28	.	A	AC	292	.	.	GT:GQ:DP:AF	1/1:292:16:1.0625
A_Seg295	15309556	.	T	TC	199	.	.	GT:GQ:DP:AF	1/1:199:99:1.0101
A_Seg295	15391885	.	A	AAC	198	.	.	GT:GQ:DP:AF	1/1:198:7:1.1429
A_Seg296	13980837	.	A	AC	417	.	.	GT:GQ:DP:AF	1/1:417:132:1.0076
A_Seg48	1768346	.	G	GC	372	.	.	GT:GQ:DP:AF	1/1:372:164:1.0061
A_Seg72	211872	.	A	AT	999	.	.	GT:GQ:DP:AF	1/1:999:36:1.0278

I also notived most of my variants actually have an af of 0.

A_Seg192	40932	.	A	C	69	.	.	GT:GQ:DP:AF	1/1:69:2:0
A_Seg192	40933	.	GGT	G	48	.	.	GT:GQ:DP:AF	1/1:48:2:0
A_Seg192	40933	.	GGTGC	G	8	.	.	GT:GQ:DP:AF	0/1:8:2:0
A_Seg192	40935	.	T	G	112	.	.	GT:GQ:DP:AF	1/1:112:2:0
A_Seg192	40937	.	C	A	93	.	.	GT:GQ:DP:AF	1/1:93:2:0
A_Seg192	40939	.	CGA	C	11	.	.	GT:GQ:DP:AF	1/1:11:2:0
A_Seg192	40940	.	GAC	G	28	.	.	GT:GQ:DP:AF	1/1:28:2:0
A_Seg192	40941	.	A	T	117	.	.	GT:GQ:DP:AF	1/1:117:2:0
A_Seg192	40943	.	C	T	97	.	.	GT:GQ:DP:AF	1/1:97:2:0
A_Seg192	40944	.	T	G	155	.	.	GT:GQ:DP:AF	1/1:155:2:0
A_Seg192	40945	.	G	T	91	.	.	GT:GQ:DP:AF	1/1:91:2:0

In the vcf output, most problems come from counting AF for insertion.
After digging into the code, turns out there is a bug in counting insertion and deletion read depth.
Hope the problem is fixed now, welcome to try again and see if any weird things happen.

Hello,
I reinstalled Clairvoyante. I don't get any more allele frequencies bigger than 1, so this seems to be fixed.
However, I am still getting af of exactly 0 .

bcftools view GC031600.160503.sorted.bam.vcf.gz| grep -v "#"|awk -F"\t|:" '{print $16}'|sort -n|head
0
0
0

This is not necessarily in INDELS regions, as filtering for SNPs only gives

bcftools view GC031600.160503.sorted.bam.vcf.gz|grep -v "#"| awk -F"\t|:" '{if (length($3) == 1 && length($4) == 1) print $16}'|sort -n|head
0
0
0

Here are sample complete lines

A_Seg100	871	.	G	C	101	.	.	GT:GQ:DP:AF	1/1:101:1:0
A_Seg100	878	.	T	G	55	.	.	GT:GQ:DP:AF	1/1:55:1:0
A_Seg100	881	.	T	G	65	.	.	GT:GQ:DP:AF	1/1:65:1:0
A_Seg100	882	.	C	A	70	.	.	GT:GQ:DP:AF	1/1:70:1:0
A_Seg100	883	.	G	A	89	.	.	GT:GQ:DP:AF	1/1:89:1:0
A_Seg100	902	.	T	G	213	.	.	GT:GQ:DP:AF	1/1:213:44:0
A_Seg100	916	.	T	C	89	.	.	GT:GQ:DP:AF	1/1:89:44:0
A_Seg100	922	.	G	T	83	.	.	GT:GQ:DP:AF	1/1:83:44:0
A_Seg100	1479	.	C	T	234	.	.	GT:GQ:DP:AF	1/1:234:116:0
A_Seg100	1480	.	C	G	411	.	.	GT:GQ:DP:AF	1/1:411:117:0

Is this expected behaviour in such cases?
Interestingly it seems to happen only for homozygous site for the ALT allele. So is this a computation like
REF/(ALT+REF)?

Hello I looked into Tablet ...
So first, it doesn't seem to be linked with 1/1 GT only
A_Seg100 1536 . T C 999 . . GT:GQ:DP:AF 1/1:999:201:0
However, when looking into Tablet at this position ... there is actually no variant visible!
as you can see just next to the 2 yellow boxes is the position 1536 ... Clairvoyante predicted a C however all the reads have a T there. So indeed, the frequency of "C" is 0 ... But then, why was it called, and why with a score of 999? The following screenshot doesn't show all the reads at the position but they are all exactly the same, I show only a few for clarity.
Is this a side effect of having put the threshold to "0" in the command? By doing this, I wanted to see all variant irrespectively of their frequency, I did not literally wanted non existing alleles :D

Screenshot from 2019-04-12 18-36-20

It is possible to get 0 AF. Clairvoyante calls variant not only based on the observation at the variant position but also the flanking 16 bases. But AF only calculates the variant at the position. For your case, I suggest you filter those variants with AF under your expectation.

Ok, thanks for your answer. I will do that.
Just out of curiosity, what kind of information could one get from an af of 0? Isn't it systematically indicating the call is wrong?

For models only considering the observation at a single genome position, it's true that AF 0 almost always leads to "the allele is not a variant". I use "almost" because it is still possible to have no observation of the true variant allele at low depth.

Clairvoyante also considers the observations of the flanking 16bp. We know that PacBio and ONT are often erroneous at the homopolymer regions, and it's not impossible that the true variant alleles are all left shifted due to a homopolymer region to the left of a true variant. Clairvoyante tackles these cases and will try to provide the right answer. But for these cases, AF will be 0 because all true variant alleles are left-shifted.

Thanks for the explanation and the support in using Clairvoyante (really cool piece of software) :)