brentp / hts-nim

nim wrapper for htslib for parsing genomics data files

Home Page:https://brentp.github.io/hts-nim/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about multiple nalts for single record

KyleStiers opened this issue · comments

Hello,

In the toy example I've spun up based off of your example code in the 'don't use pileup' post, I am having trouble wrapping my head around how for a single record you can have multiple nalt entries. I'm nearly positive I'm missing something obvious, but I would expect in a single entry from the bam to see either a +1 to nalt or a +1 to nref, but using exactly that code I am getting +3 nalt when there are multiple cigar operations.

Does this perhaps happen when there are cigar operations beyond the one that consumes the position of interest?

Note, I had to cast int(qoff-over) to get it to compile for myself and I also just currently compare base == 'T' because I didn't see an immediate way to pull reference base for ref_allele in the toy example. Also, added limit to printing so I could still play with large bams easily.

import hts

var b:Bam
open(b, "path/to/file.bam", index=true)

var 
    position = 10
    LIMIT = 200
    count = 0

for aln in b.query(0, position, position + 1):
    count += 1
    var 
        off = aln.start
        qoff = 0
        roff_only = 0
        nalt = 0
        nref = 0
    for event in aln.cigar:
        var cons = event.consumes
        if cons.query:
            qoff += event.len
        if cons.reference:
            off += event.len
            if not cons.query:
                roff_only += event.len
        # continue until we get to the genomic position
        if off <= position: continue
        # since each cigar op can consume many bases
        # calc how far past the requested position
        var over = off - position - roff_only
        # get the base 
        var base = aln.base_at(int(qoff - over))
        # var ref_allele = aln.
        if base == 'T':
            nref += 1
        else:
            nalt += 1
        
    if count < LIMIT:
        echo "nref: ", nref, " nalt: ",nalt

Hi Kyle, you need to get the reference allele from the fasta.
It looks like you need to move this:

        nalt = 0
        nref = 0

to outside of the outer for loop. The outer loop is over reads.
Then you can dedent the final if statement.

You can use the code in hileup or use it as a nim or python library.

Thanks for the quick reply! Moving them to the outside of the outer for loop did capture the total number for the whole file, but I am more interested in figuring out if it there is any situation where it is correct for nalt/nref to be > 1 in a single record. If there isn't then I think it might be worth updating the blog post - as even moving it out of the for loop scope will give you an inflated number. I put a break after the incrementing of nref and nalt to eliminate this, but I'm not sure if that's correct yet or not.

I'll check out hileup, looks very cool. Thanks! I'll close out the question with that.

the outer loop in the blog post is over alignments that cover a single base, so alt / (ref + alt) should be the allele balance. You are correct that I have written it incorrectly in the blog post. I'll see about updating. But yeah, just use hileup.

Great, thanks. You cautioned it was written quickly as an example in the article - I just was trying to work through understanding it. Thanks again, I appreciate your replies.