biogo / hts

biogo high throughput sequencing repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tabix misses intervals at start of file

brentp opened this issue · comments

With this setup:

echo $'chr1\t1\t100' | bgzip -c > t.bed.gz; tabix t.bed.gz

And the script below, tabix misses all intervals. Even if I have thousands of intervals at < 10KB position, it will return empty chunks. This change seems to fix:

diff --git a/internal/index.go b/internal/index.go
index 1287db0..baaf656 100644
--- a/internal/index.go
+++ b/internal/index.go
@@ -293,6 +293,7 @@ func OverlappingBinsFor(beg, end int) []uint32 {
        for _, r := range []struct {
                offset, shift uint32
        }{
+               {level0, level0Shift},
                {level1, level1Shift},
                {level2, level2Shift},
                {level3, level3Shift},
package main

import (
    "compress/gzip"
    "io/ioutil"
    "log"
    "os"

    "github.com/biogo/hts/bgzf"
    "github.com/biogo/hts/bgzf/index"
    "github.com/biogo/hts/tabix"
)

func check(err error) {
    if err != nil {
        panic(err)
    }
}

type location struct {
    chrom string
    start int
    end   int
}

func (s location) RefName() string {
    return s.chrom
}
func (s location) Start() int {
    return s.start
}
func (s location) End() int {
    return s.end
}

func main() {

    path := os.Args[1]

    fh, err := os.Open(path + ".tbi")
    check(err)

    gz, err := gzip.NewReader(fh)
    check(err)
    defer gz.Close()

    idx, err := tabix.ReadFrom(gz)
    check(err)

    b, err := os.Open(path)
    check(err)
    bgz, err := bgzf.NewReader(b, 2)

    check(err)

    chunks, err := idx.Chunks(location{"chr1", 1, 19999999})
    check(err)
    log.Println(chunks)

    cr, err := index.NewChunkReader(bgz, chunks)
    buf, _ := ioutil.ReadAll(cr)

    log.Println(len(buf))

}

That doesn't fix it for me, and looking at the returns in the current situation for BinFor(1, 100) (4681) and OverlappingBinsFor(1, 19999999) (includes 4681), this is not the problem.

The issue is in the Chunks method during tile interval filtering.

It should be true that if you write more than one bgzf block then all, also it is not that records are under 10k, but rather that all records map to the same bin. There is an optimisation in Chunks that avoids looking at unproductive tiles - it was being overly aggressive.

PR coming.