Vcf FORMAT missing values.
cassimons opened this issue · comments
Hi Brent,
I am trying to understand how missing FORMAT values are handled for vcfs, specifically for int fields.
I am not capable of following the htslib code, but as far as I can figure missing values in the vcf (".") should be represented as int32.low ( is this correct? ).
Assuming that is correct, in some circumstances I have also observed these fields to be encoded as int32.low + 1. Specifically I have seen this for format values with multiple values per sample. For example:
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
Where there are multiple fields per sample the first missing value for each sample is -2147483648 (int32.low) and the rest seem to be -2147483647 (int32.low +1).
Is this an expected or undefined behaviour? Or is there a better way I should be testing for missing data?
I have attached an example of what I see.
Thanks for your help and thanks for this and your other great tools.
Cheers.
I think I see what is happening; htslib defines these:
#define bcf_int32_missing (-2147483647-1) /* INT32_MIN */
#define bcf_int32_vector_end (-2147483647) /* INT32_MIN + 1 */
so that jives with what you see. you could write a helper function like:
proc reshape*[T](values: seq[T], n_samples:int): seq[seq[T]] =
that uses vector_end,
which would return a something like @[[missing], [missing], ..., [255, 255, 0], ..., [255,151,0] ... ]
if it were general enough, I'd accept a PR.
does this resolve your issue?
Yes thanks, that makes sense. Feel free to close.
With regard to a more general solution, I will have to have a think about it as I imagine how missing values should be handled will be project specific. Given there is a similar challenge with missing values from each of the INFO and FORMAT types, would implementing something like an is_missing proc for each of the relevant types be the way to go?
I am happy to have a go and see if I can figure out how to do something like this this if you think it is worth exploring.
Thanks for your help.
yes, I guess:
proc is_missing[T:int32|int8|int16|int64](v:T): bool {.inline.} =
v == T.low
could work. then similar for is_vector_end
. but you're right, not sure what beyond this. I'll think on it.
something like this should be pretty close:
import hts/vcf
import strutils
proc is_missing[T:int32|int8|int16|int64](v:T): bool {.inline.} =
v == T.low
proc is_vector_end[T:int32|int8|int16|int64](v:T): bool {.inline.} =
v == T.low + 1
proc show*[T:int32|int8|int16|int64](reshaped:seq[seq[T]]): string =
result = newStringOfCap(255)
result.add('[')
for i in 0..<reshaped.len:
result.add('[')
for j in 0..<reshaped[i].len:
if reshaped[i][j].is_missing:
result.add('.'):
else:
result.add(reshaped[i][j])
if reshaped[i][j].is_vector_end: break
if j < reshaped[i].high: result.add(',')
result.add(']')
if i < reshaped.high:
result.add(",\n")
result.add(']')
proc reshape*[T](values: seq[T], n_samples:int): seq[seq[T]] =
result = newSeq[seq[T]](n_samples)
let n_per = int(values.len / n_samples)
for i in 0..<n_samples:
var off = i * n_per
for j in off..<off+n_per:
if values[j].is_vector_end: break
result[i].add(values[j])
when isMainModule:
var v:VCF
doAssert(open(v, "missing_value.vcf"))
var pls = new_seq[int32](0)
for rec in v:
doAssert rec.format.get("PL", pls) == Status.OK
var r = pls.reshape(v.n_samples)
echo r
echo "show:"
echo r.show