fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VCF header fields missing

Krannich479 opened this issue · comments

Hello Fritz & SURVIVOR dev team,

What.

I attempted to use the VCF file generated by SURVIVOR with bcftools, as for instance recommended in issue #173. However, bcftools suffers from a bug that originates from SURVIVOR I think.

Error.

When using bcftools (sort+index) on SURVIVOR's truthset VCF, I get warnings and an error regarding the VCF header not matching the FORMAT fields.

bcftools sort -o simulated.sorted.vcf.gz -O z simulated.vcf; bcftools index -t simulated.sorted.vcf.gz

Writing to /tmp/bcftools-sort.9eyTIT
[W::vcf_parse_format] FORMAT 'GL' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GQ' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'FT' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RC' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DV' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RV' is not defined in the header, assuming Type=String
[E::bcf_write] Unchecked error (2), exiting

I looked into the code at

convert << "\t.\tPASS\tPRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1\tGT:GL:GQ:FT:RC:DR:DV:RR:RV\t";
where the FORMAT field for SNP variant records is written. In case the print_vcf_header2 function above is the corresponding header than the FORMAT fields indeed do not match.

Solution.

I propose two trivial solutions here:

  1. Removing the FORMAT fields from the SNP variant records.
  2. Adding the FORMAT fields to the function that generates the header.

I am voting for solution 1 here because:
a) I think FT, RC, DR, DV, RR, RV are ancient relics of Lumpy unrelated to SNPs. Also these FORMAT fields have been commented out throughout most of SURVIVOR.
b) The missing fields are not part of the VCF4.2 standard and should not be present if not defined and used.
c) I tested that the VCF file generated by SURVIVOR works flawlessly with bcftools if the fields are removed from SNP records. (see hotfix below)

Hotfix.

sed -i 's/GT:GL:GQ:FT:RC:DR:DV:RR:RV/GT/g ' simulated.vcf where simulated.vcf is the VCF by SURVIVOR.

commented

I am suffering the exact same problem here as you were. But I did not set SNP cuz I am not intended to study SNP, only SVs are of my interest.
So does it mean that I can use the:

sed -i 's/GT:GL:GQ:FT:RC:DR:DV:RR:RV/GT/g ' simulated.vcf

To fix it?
Thanks

commented

OK I figured it out, besides that line of code, we also need to add headers like this manually:

##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise variant">

Hi @rl4940,
the issue I had, the hotfix I proposed (in this issue) as well as the PR #190 aim for a format correction of SNV records only. However, I hypothesize that if your callset does not include the additional fields (GQ:FT:RC:DR:DV:RR:RV) it's probably safe to use the sed command from above.


(disclaimer: I am not a maintainer of SURVIVOR, I just tried to fix a bug here)