tabix misses intervals at start of file

Question

tabix misses intervals at start of file

brentp opened this issue 9 years ago · comments

With this setup:

echo $'chr1\t1\t100' | bgzip -c > t.bed.gz; tabix t.bed.gz

And the script below, tabix misses all intervals. Even if I have thousands of intervals at < 10KB position, it will return empty chunks. This change seems to fix:

diff --git a/internal/index.go b/internal/index.go
index 1287db0..baaf656 100644
--- a/internal/index.go
+++ b/internal/index.go
@@ -293,6 +293,7 @@ func OverlappingBinsFor(beg, end int) []uint32 {
        for _, r := range []struct {
                offset, shift uint32
        }{
+               {level0, level0Shift},
                {level1, level1Shift},
                {level2, level2Shift},
                {level3, level3Shift},

package main

import (
    "compress/gzip"
    "io/ioutil"
    "log"
    "os"

    "github.com/biogo/hts/bgzf"
    "github.com/biogo/hts/bgzf/index"
    "github.com/biogo/hts/tabix"
)

func check(err error) {
    if err != nil {
        panic(err)
    }
}

type location struct {
    chrom string
    start int
    end   int
}

func (s location) RefName() string {
    return s.chrom
}
func (s location) Start() int {
    return s.start
}
func (s location) End() int {
    return s.end
}

func main() {

    path := os.Args[1]

    fh, err := os.Open(path + ".tbi")
    check(err)

    gz, err := gzip.NewReader(fh)
    check(err)
    defer gz.Close()

    idx, err := tabix.ReadFrom(gz)
    check(err)

    b, err := os.Open(path)
    check(err)
    bgz, err := bgzf.NewReader(b, 2)

    check(err)

    chunks, err := idx.Chunks(location{"chr1", 1, 19999999})
    check(err)
    log.Println(chunks)

    cr, err := index.NewChunkReader(bgz, chunks)
    buf, _ := ioutil.ReadAll(cr)

    log.Println(len(buf))

}

Dan Kortschak · Answer 1 · Fri Sep 18 2015 07:28:01 GMT+0800 (China Standard Time)

That doesn't fix it for me, and looking at the returns in the current situation for BinFor(1, 100) (4681) and OverlappingBinsFor(1, 19999999) (includes 4681), this is not the problem.

The issue is in the Chunks method during tile interval filtering.

Dan Kortschak · Answer 2 · Fri Sep 18 2015 07:54:57 GMT+0800 (China Standard Time)

It should be true that if you write more than one bgzf block then all, also it is not that records are under 10k, but rather that all records map to the same bin. There is an optimisation in Chunks that avoids looking at unproductive tiles - it was being overly aggressive.

PR coming.