brentp / vcfgo

a golang library to read, write and manipulate files in the variant call format.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vcfgo seems to be stricter than the VCF spec

carbocation opened this issue · comments

The VCF spec Section 1.2 appears to allow arbitrary meta-information lines (starting with ##).

vcfgo fails to open VCFs that contain unexpected meta-information fields that nevertheless seem spec compliant (e.g., "##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.").

It seems that this check should either be liberalized or perhaps should go away.

vcfgo/reader.go

Lines 132 to 142 in bdb8e83

} else if strings.HasPrefix(line, "#CHROM") {
var err error
h.SampleNames, err = parseSampleLine(line)
verr.Add(err, LineNumber)
//h.Validate(verr)
break
} else {
e := fmt.Errorf("unexpected header line: %s", line)
return nil, e
}

is this just a theoretical concern? as for example vcfanno has probably been used on 100's of thousands of VCF files and it has not failed on lines like these. You can see in reader.go where it has parseHeaderExtraKV(line) that it is saving these lines.
If you have a test-case where this is not working, I'd be happy to take a look.

Currently, I'm getting the following error when trying to parse some mutect output:

2019/05/07 10:10:32 'unexpected header line: '

However, I don't yet have enough of a handle on this error to know whether I've identified the correct root of the problem. It just came up as I was digging through possible causes. (Perhaps, for example, there is a carriage return or some other invalid character that could explain why the line appears to be empty.)

Will keep digging and get back to you if I can nail this down as the definitive issue.

what does bcftools view $vcf > /dev/null show on the VCF?
from that message, it seems like you might have an empty line on the header.

-bash:uger-r7-c005:~ $ use .bcftools-1.8
Prepending: .bcftools-1.8 (ok)
-bash:uger-r7-c005:~ $ bcftools view sample-filtered.vcf.gz > /dev/null
-bash:uger-r7-c005:~ $ 

ok. if you can upload the vcf header, I'll have a look.

Thanks. Emailed.

I changed that highlighted segment of code to the following:

		 else {
			verr.Add(fmt.Errorf("unexpected header line: %s", line), LineNumber)
			break
		}

When I make that change, err from rdr, err = vcfgo.NewReader is now nil and rdr is no longer nil, so I can try to parse the vcf. When I do so, rdr.Error() now populates with an informative message:

2019/05/07 10:27:05 flate: corrupt input before offset 4090. [line: 50]
INFO error: ##INFO=<ID=NCount,Number=1,Type=Integer,D, []. [line: 50]
unexpected header line: . [line: 51]

So, it looks like the header is not of a format that vcfgo likes (I'll dig into it to see if it's even valid). But, the lack of error message in the current version, at least, seems to be unintended, and I wonder if a modification like this would be helpful?

(It looks like this is an issue with my gzip decompression upstream, but the error is getting hidden by the way vcfgo handles meta-information headers. At least, that's my current interpretation.)