vcfgo seems to be stricter than the VCF spec
carbocation opened this issue · comments
The VCF spec Section 1.2 appears to allow arbitrary meta-information lines (starting with ##).
vcfgo fails to open VCFs that contain unexpected meta-information fields that nevertheless seem spec compliant (e.g., "##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.").
It seems that this check should either be liberalized or perhaps should go away.
Lines 132 to 142 in bdb8e83
is this just a theoretical concern? as for example vcfanno has probably been used on 100's of thousands of VCF files and it has not failed on lines like these. You can see in reader.go where it has parseHeaderExtraKV(line)
that it is saving these lines.
If you have a test-case where this is not working, I'd be happy to take a look.
Currently, I'm getting the following error when trying to parse some mutect output:
2019/05/07 10:10:32 'unexpected header line: '
However, I don't yet have enough of a handle on this error to know whether I've identified the correct root of the problem. It just came up as I was digging through possible causes. (Perhaps, for example, there is a carriage return or some other invalid character that could explain why the line appears to be empty.)
Will keep digging and get back to you if I can nail this down as the definitive issue.
what does bcftools view $vcf > /dev/null
show on the VCF?
from that message, it seems like you might have an empty line on the header.
-bash:uger-r7-c005:~ $ use .bcftools-1.8
Prepending: .bcftools-1.8 (ok)
-bash:uger-r7-c005:~ $ bcftools view sample-filtered.vcf.gz > /dev/null
-bash:uger-r7-c005:~ $
ok. if you can upload the vcf header, I'll have a look.
Thanks. Emailed.
I changed that highlighted segment of code to the following:
else {
verr.Add(fmt.Errorf("unexpected header line: %s", line), LineNumber)
break
}
When I make that change, err
from rdr, err = vcfgo.NewReader
is now nil and rdr is no longer nil, so I can try to parse the vcf. When I do so, rdr.Error()
now populates with an informative message:
2019/05/07 10:27:05 flate: corrupt input before offset 4090. [line: 50]
INFO error: ##INFO=<ID=NCount,Number=1,Type=Integer,D, []. [line: 50]
unexpected header line: . [line: 51]
So, it looks like the header is not of a format that vcfgo likes (I'll dig into it to see if it's even valid). But, the lack of error message in the current version, at least, seems to be unintended, and I wonder if a modification like this would be helpful?
(It looks like this is an issue with my gzip decompression upstream, but the error is getting hidden by the way vcfgo handles meta-information headers. At least, that's my current interpretation.)