brentp / hts-nim

nim wrapper for htslib for parsing genomics data files

Home Page:https://brentp.github.io/hts-nim/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

expose more fai stuff? (and some slowness with Fai.get())

kmhernan opened this issue · comments

I really enjoy this package and it has brought me to use Nim, so thanks for doing this. Let's say I have a task of wanting to loop over every record in a fasta (with fai) and process each contig to get some metrics. First, I think there should be some more exposure of htslib fai stuff:

proc faidx_iseq*(fai: ptr faidx_t; i: cint): cstring {.cdecl,
    importc: "faidx_iseq", dynlib: "libhts.so".}

proc faidx_seq_len*(fai: ptr faidx_t; seq: cstring): cint {.cdecl,
    importc: "faidx_seq_len", dynlib: "libhts.so".}

Now I can do something like:

  for x in 0..<len(fai):
    res = faidx_iseq(fai.cptr, cint(x))
    refLen = faidx_seq_len(fai.cptr, res).int
    refBases = toUpper(fai.get($res))
    lastWindowStart = refLen - windowSize
    var state = new CalculateGcState
    state.init = true

    for i in 1..<lastWindowStart:
      var windowEnd = i + windowSize
      var gcBin = calculateGc(refBases, i, windowEnd, state)
      if gcBin != -1:
        result[gcBin] += 1
    echo result
  free(res)

(ok I've use Nim for less than a month so don't judge me here). Since I don't know what contigs are in the fasta, exposing the faidx_iseq helps get the string name of the contig so I can use your get function and then secondarily use the faidx_seq_len to get its length.

Now I have found that the fai.get() is extremely slow... I mean even the python htslib is faster for FastaFile.fetch() ... is there something I can do better here? Would it be better for me to do a lot of fetches instead? Yes, they are large contigs, but still it seems crazy slow for the fetch.

That's great to hear that it's getting use!
Just yesterday, I added fai.chrom_len() which exposes faidx_seq_len and I just now pushed a way to expose faidx_iseq by using []. You can see the tests in the commit for an example.

with these changes you can write something like:

for x in 0..fai.len-1:
  var L = fai.chrom_len(fai[i])
  ...

For why/how it is slow, can you show an example so I can test? It should be similar speed to pysam, but it does have to make a copy of the string because (as you know) a cstring is not the same as a string in nim.

Please let me know any additional issues you encounter.

and also re the performance, you are compiling with -d:release yes?

Thanks I just saw the chrom_len() after I posted. My test was just chr1 from GRCh38 (https://api.gdc.cancer.gov/data/62f23fad-0f24-43fb-8844-990d531947cf). so fa.get("chr1") in nim and fa.fetch(region="chr1") in pysam...

Ok after doing a much better "test" i actually see similar stats:

  • python: 169.11user 41.20system 3:30.34elapsed 99%CPU (0avgtext+0avgdata 990340maxresident)k
import pysam

fa = pysam.FastaFile("/mnt/SCRATCH/refdata/GRCh38.d1.vd1.fa")

for i in range(100):
    seq = fa.fetch(region="chr1")

fa.close()
  • nim: 168.89user 17.64system 3:06.55elapsed 99%CPU (0avgtext+0avgdata 493344maxresident)k
import hts

when isMainModule:
  var
    fafil = "/mnt/SCRATCH/refdata/GRCh38.d1.vd1.fa"
    fa: Fai
    chrom = "chr1"

  discard open(fa, fafil)
  for i in 0..<100:
    var bases = fa.get(chrom)

So, I think you can ignore that statement

great! let me know if you hit any more issues.