brentp / hts-nim

nim wrapper for htslib for parsing genomics data files

Home Page:https://brentp.github.io/hts-nim/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vcf FORMAT missing values.

cassimons opened this issue · comments

Hi Brent,

I am trying to understand how missing FORMAT values are handled for vcfs, specifically for int fields.

I am not capable of following the htslib code, but as far as I can figure missing values in the vcf (".") should be represented as int32.low ( is this correct? ).

Assuming that is correct, in some circumstances I have also observed these fields to be encoded as int32.low + 1. Specifically I have seen this for format values with multiple values per sample. For example:
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">

Where there are multiple fields per sample the first missing value for each sample is -2147483648 (int32.low) and the rest seem to be -2147483647 (int32.low +1).

Is this an expected or undefined behaviour? Or is there a better way I should be testing for missing data?

I have attached an example of what I see.

Thanks for your help and thanks for this and your other great tools.

Cheers.

missing_value.vcf.gz
missing_test.nim.gz

I think I see what is happening; htslib defines these:

#define bcf_int32_missing    (-2147483647-1) /* INT32_MIN */

#define bcf_int32_vector_end (-2147483647)  /* INT32_MIN + 1 */

so that jives with what you see. you could write a helper function like:

proc reshape*[T](values: seq[T], n_samples:int): seq[seq[T]] =

that uses vector_end,

which would return a something like @[[missing], [missing], ..., [255, 255, 0], ..., [255,151,0] ... ]

if it were general enough, I'd accept a PR.

does this resolve your issue?

Yes thanks, that makes sense. Feel free to close.

With regard to a more general solution, I will have to have a think about it as I imagine how missing values should be handled will be project specific. Given there is a similar challenge with missing values from each of the INFO and FORMAT types, would implementing something like an is_missing proc for each of the relevant types be the way to go?

I am happy to have a go and see if I can figure out how to do something like this this if you think it is worth exploring.

Thanks for your help.

yes, I guess:

proc is_missing[T:int32|int8|int16|int64](v:T): bool {.inline.} =
    v == T.low

could work. then similar for is_vector_end. but you're right, not sure what beyond this. I'll think on it.

something like this should be pretty close:

import hts/vcf
import strutils


proc is_missing[T:int32|int8|int16|int64](v:T): bool {.inline.} =
    v == T.low

proc is_vector_end[T:int32|int8|int16|int64](v:T): bool {.inline.} =
    v == T.low + 1


proc show*[T:int32|int8|int16|int64](reshaped:seq[seq[T]]): string =
  result = newStringOfCap(255)
  result.add('[')
  for i in 0..<reshaped.len:
    result.add('[')
    for j in 0..<reshaped[i].len:
      if reshaped[i][j].is_missing:
        result.add('.'):
      else:
        result.add(reshaped[i][j])
      if reshaped[i][j].is_vector_end: break
      if j < reshaped[i].high: result.add(',')
    result.add(']')
    if i < reshaped.high:
      result.add(",\n")

  result.add(']')


proc reshape*[T](values: seq[T], n_samples:int): seq[seq[T]] =
  result = newSeq[seq[T]](n_samples)
  let n_per = int(values.len / n_samples)
  for i in 0..<n_samples:
    var off = i * n_per
    for j in off..<off+n_per:
      if values[j].is_vector_end: break
      result[i].add(values[j])

when isMainModule:

  var v:VCF
  doAssert(open(v, "missing_value.vcf"))

  var pls = new_seq[int32](0)

  for rec in v:
    doAssert rec.format.get("PL", pls) == Status.OK
    var r = pls.reshape(v.n_samples)
    echo r
    echo "show:"
    echo r.show