expose more fai stuff? (and some slowness with Fai.get())
kmhernan opened this issue · comments
I really enjoy this package and it has brought me to use Nim, so thanks for doing this. Let's say I have a task of wanting to loop over every record in a fasta (with fai) and process each contig to get some metrics. First, I think there should be some more exposure of htslib fai stuff:
proc faidx_iseq*(fai: ptr faidx_t; i: cint): cstring {.cdecl,
importc: "faidx_iseq", dynlib: "libhts.so".}
proc faidx_seq_len*(fai: ptr faidx_t; seq: cstring): cint {.cdecl,
importc: "faidx_seq_len", dynlib: "libhts.so".}
Now I can do something like:
for x in 0..<len(fai):
res = faidx_iseq(fai.cptr, cint(x))
refLen = faidx_seq_len(fai.cptr, res).int
refBases = toUpper(fai.get($res))
lastWindowStart = refLen - windowSize
var state = new CalculateGcState
state.init = true
for i in 1..<lastWindowStart:
var windowEnd = i + windowSize
var gcBin = calculateGc(refBases, i, windowEnd, state)
if gcBin != -1:
result[gcBin] += 1
echo result
free(res)
(ok I've use Nim for less than a month so don't judge me here). Since I don't know what contigs are in the fasta, exposing the faidx_iseq
helps get the string name of the contig so I can use your get
function and then secondarily use the faidx_seq_len
to get its length.
Now I have found that the fai.get()
is extremely slow... I mean even the python htslib is faster for FastaFile.fetch()
... is there something I can do better here? Would it be better for me to do a lot of fetches instead? Yes, they are large contigs, but still it seems crazy slow for the fetch.
That's great to hear that it's getting use!
Just yesterday, I added fai.chrom_len()
which exposes faidx_seq_len
and I just now pushed a way to expose faidx_iseq by using []. You can see the tests in the commit for an example.
with these changes you can write something like:
for x in 0..fai.len-1:
var L = fai.chrom_len(fai[i])
...
For why/how it is slow, can you show an example so I can test? It should be similar speed to pysam, but it does have to make a copy of the string because (as you know) a cstring is not the same as a string in nim.
Please let me know any additional issues you encounter.
and also re the performance, you are compiling with -d:release
yes?
Thanks I just saw the chrom_len() after I posted. My test was just chr1
from GRCh38 (https://api.gdc.cancer.gov/data/62f23fad-0f24-43fb-8844-990d531947cf). so fa.get("chr1")
in nim and fa.fetch(region="chr1")
in pysam...
Ok after doing a much better "test" i actually see similar stats:
- python:
169.11user 41.20system 3:30.34elapsed 99%CPU (0avgtext+0avgdata 990340maxresident)k
import pysam
fa = pysam.FastaFile("/mnt/SCRATCH/refdata/GRCh38.d1.vd1.fa")
for i in range(100):
seq = fa.fetch(region="chr1")
fa.close()
- nim:
168.89user 17.64system 3:06.55elapsed 99%CPU (0avgtext+0avgdata 493344maxresident)k
import hts
when isMainModule:
var
fafil = "/mnt/SCRATCH/refdata/GRCh38.d1.vd1.fa"
fa: Fai
chrom = "chr1"
discard open(fa, fafil)
for i in 0..<100:
var bases = fa.get(chrom)
So, I think you can ignore that statement
great! let me know if you hit any more issues.