aquaskyline / Skyhawk

An Artificial Neural Network-based discriminator for validating clinically significant genomic variants

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Skyhawk writes output on the same line in a loop in certain contexts

AndrewCarroll opened this issue · comments

I am experiencing an issue where Skyline seems to be stuck writing the same line as output, producing a file of many GB in size. I am getting this issue in the "decoy" contigs of hg38.

To reproduce this - run with the following inputs:

VCF - https://dl.dnanex.us/F/D/XxZ3BB0jpxK21X4v8f3Z0jJZQq2f6zQ1YKyZBp2Y/HG001.gatk.chrUn.vcf

BAM (this is ~60GB, sorry) -
https://dl.dnanex.us/F/D/2f9b396g6xgz57pB7vf91f1zXPP43KqxBPx48Xgx/hiseq2500.plus0.0.R1.bam

Reference ( ~1 GB) -
https://dl.dnanex.us/F/D/ZKqk8Kfj62p8pgGqyPg5kfzpXVK39VKQ6Ygp812q/GRCh38.no_alt_analysis_set.fa.gz

pypy skyhawk/validateVar.py --chkpnt_fn ./trainedModels/illumina-novoalign-2500-tspcrfree-hg001+hg002+hg003+hg004+hg005-hg38/learningRate1e-3.epoch100.learningRate1e-4.epoch200 --ref_fn ref.fa --bam_fn input.bam --vcf_fn other.vcf --val_fn skyhawk.other.txt 2>log.other.txt

In this case, I have isolated the decoy regions. If I run with the full VCF, it will process the main chromosomes correctly and then hang on the decoy.

Although I can work around this problem by simply running this on chr1-22,X,Y and not the decoys, it is likely than many users will have sequences which have decoys and will have to discover and work around this issue in a similar way. It's not clear to me whether there is some other unusual property which causes this issue in only my sample.

Thanks for sharing the files. I've updated Skyhawk to work on only the primary chromosomes by default, or use "--allChrom" to work on all chromosomes. The rationale behind this update is that, Skyhawk is targeting clinical genomics, thus variants identified in decoy or other non-primary chromosomes are not of interest. Later I will work on your uploaded files and find out what caused the dead loop. I suspect it's the very high depth in decoy that caused the problem.

I am experiencing same issue with last variant from vcf. In my case located on chrY:22751501, not on decoy. I have variants located exclusively on chr1-22,X,Y (hg19).

Skyhawk finished running on your dataset without problem on my computer, the result is available at http://bio8.cs.hku.hk/share/andrewcarroll.skyhawk.out
I've made several modifications to the code but I'm not sure they solve your problem or not @przemekl , please try the new code on your data again.

validateVar.py exits with message:
"Should not reach here:
['chrY', '22751501', '.', 'G', 'A', '937.77', '.', 'AC=2;AF=1;AN=2;DP=22;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.95;QD=28.28;SOR=0.874', 'GT:AD:DP:GQ:PL', '1/1:0,22:22:66:966,66,0']
['chr14', '107283150', '.', 'G', 'T', '281', '.', '.', 'GT:GQ:DP', '0/1:281:31']"

chrY:22751501 is the last variant in the vcf file.
chr14:107283150 is somewhere in the middle, but line looks diffrent:
chr14 107283150 . G T 489.77 . AC=1;AF=0.5;AN=2;BaseQRankSum=-0.307;ClippingRankSum=-1.266;DP=32;ExcessHet=3.0103;FS=1.428;MLEAC=1;MLEAF=0.5;MQ=58.42;MQRankSum=-0.23;QD=15.31;ReadPosRankSum=2.647;SOR=0.625 GT:AD:DP:GQ:PL 0/1:19,13:32:99:518,0,1646

@przemekl Would you be able to share your VCF file to me? Thanks.

Unfortunately, this is not possible, but I will help to the full extent if you invent another way to determine the cause of the problems.

I've changed the logic a bit, would you please try the latest code and show me the log if fails again.

This time the message is:
"Please make sure your VCF input is sorted. Skyhawk exited.
['chr1', '899989', '.', 'A', 'C', '511.77', '.', 'AC=2;AF=1;AN=2;DP=12;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=54.59;QD=26.16;SOR=1.022', 'GT:AD:DP:GQ:PL', '1/1:0,12:12:36:540,36,0']
['chr1', '899928', '.', 'G', 'C', '999', '.', '.', 'GT:GQ:DP', '1/1:999:31']"

Mentioned fragment of vcf file looks as follows:
chr1 899928 . G C 1112.77 <...>
chr1 899937 . G T 1060.77 <...>
chr1 899938 . G C 1029.77 <...>
chr1 899942 . G A 961.77 <...>
chr1 899989 . A C 511.77 <...>

Chromosomes have the natural sort order: chr1,chr2,chr3 ... chr22,chrX,chrY.
Should I sort them in a different way?

@przemekl thanks for your message. With it I've found and fixed and an insidious bug, which extracts repeated BAM records in some boundary cases. Would you please be so patiently try again the latest commit and let me know how it goes.

Please make sure your VCF input is sorted. Skyhawk exited (3).
['chr1', '899989', '.', 'A', 'C', '511.77', '.', 'AC=2;AF=1;AN=2;DP=12;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=54.59;QD=26.16;SOR=1.022', 'GT:AD:DP:GQ:PL', '1/1:0,12:12:36:540,36,0']
['chr1', '899928', '.', 'G', 'C', '999', '.', '.', 'GT:GQ:DP', '1/1:999:31']

could you please use the "--debug" option and show me the last 10 lines.

Or if you can send me a list of positions of your VCF file, that would make things easier. You can extract the positions using command
awk '!/^#/{print $1"\t"$2}' your.vcf > extracted.positions
Notice that the extracted list contains no genotype information thus practically impossible to associate it to any individuals.

I think I know what could cause problems.
Vcf file contains multiallelic sites which have been splitted into biallelic records with command 'bcftools norm -m-both'. What do you think?

Debug output:
2 chr1 752894 chr1 69511
2 chr1 762273 chr1 752894
2 chr1 762589 chr1 762273
2 chr1 762592 chr1 762589
2 chr1 762601 chr1 762592
2 chr1 762632 chr1 762601
2 chr1 792263 chr1 762632
2 chr1 792480 chr1 792263
2 chr1 866319 chr1 792480
2 chr1 866511 chr1 866319
2 chr1 871334 chr1 866511
2 chr1 876499 chr1 871334
2 chr1 877715 chr1 876499
2 chr1 877831 chr1 877715
2 chr1 879148 chr1 877831
2 chr1 879317 chr1 879148
2 chr1 879676 chr1 879317
2 chr1 879687 chr1 879676
2 chr1 880238 chr1 879687
2 chr1 880390 chr1 880238
2 chr1 881627 chr1 880390
2 chr1 883625 chr1 881627
2 chr1 884091 chr1 883625
2 chr1 886788 chr1 884091
2 chr1 886817 chr1 886788
2 chr1 886817 chr1 886817
2 chr1 887560 chr1 886817
2 chr1 887801 chr1 887560
2 chr1 888639 chr1 887801
2 chr1 888659 chr1 888639
2 chr1 889158 chr1 888659
2 chr1 889159 chr1 889158
2 chr1 889638 chr1 889159
2 chr1 892460 chr1 889638
2 chr1 892745 chr1 892460
2 chr1 894573 chr1 892745
2 chr1 896064 chr1 894573
2 chr1 897325 chr1 896064
2 chr1 897475 chr1 897325
1 chr1 897564 chr1 897564
2 chr1 897730 chr1 897564
2 chr1 898323 chr1 897730
2 chr1 899928 chr1 898323
1 chr1 899937 chr1 899937
2 chr1 899938 chr1 899937
2 chr1 899942 chr1 899938
2 chr1 899989 chr1 899942

After some rounds of emails with @przemekl, the problems were fixed and two potential enhancements are scheduled: 1) able to deal with VCF input with duplicated positions; 2) add Skyhawk decisions also to the VCF output.