biogo / hts

biogo high throughput sequencing repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

csi: malformed dummy bin header

brentp opened this issue · comments

On csi files created by htslib (with bcftools index or with tabix), I see this error.

I think there must be something wrong with the check: if bins[i].bin == statsDummyBin in csi_read.go that causes this (given enough data) to trigger when n != 2. I can't find in the spec or htslib code how you came up with that. Is there any other (currently missing) constraint for this check?

For most files, I do not have this problem, but for 1 large file I do.

This is 1 of 2 issues related to CSI that I've found. Opening the other issue presently.

If I add log.Printf("i: %d, n: %d, bins[i]: %+v", i, n, bins[i]) just before that error is returned, I see:

 i: 2462, n: 1, bins[i]: {bin:37450 left:{File:11295 Block:19731} records:0 chunks:[]}

This blind guess change fixes the problem and doesn't introduce other errors:

diff --git a/csi/csi_read.go b/csi/csi_read.go
index 97ab3de..f0ff846 100644
--- a/csi/csi_read.go
+++ b/csi/csi_read.go
@@ -132,7 +132,7 @@ func readBins(r io.Reader, version byte) ([]bin, *index.ReferenceStats, error) {
                if err != nil {
                        return nil, nil, fmt.Errorf("csi: failed to read bin count: %v", err)
                }
-               if bins[i].bin == statsDummyBin {
+               if bins[i].bin == statsDummyBin && bins[i].left.Block == 0 {
                        if n != 2 {
                                return nil, nil, errors.New("csi: malformed dummy bin header")
                        }

The origin of this would be explained if samtools/hts-specs#70 ever got addressed.

Maybe we should just remove that.

If you can find the code in htslib that handles this I'll look into it (I'm on leave and the last thing I feel like dealing with is htslib/samtools code).

In htslib, I can't find any reading of the stats bins. They are defined here in an unexported struct.

the are written as here:

https://github.com/samtools/htslib/blob/8003166a059eb92f532cc64667160bc497a01b13/hts.c#L1470

with META_BIN defined as:

// Finds the special meta bin
//  ((1<<(3 * n_lvls + 3)) - 1) / 7 + 1
#define META_BIN(idx) ((idx)->n_bins + 1)

I have also asked on samtools-help

IIUC , this would be the appropriate change:

diff --git a/csi/csi_read.go b/csi/csi_read.go
index 97ab3de..6db5a8c 100644
--- a/csi/csi_read.go
+++ b/csi/csi_read.go
@@ -132,7 +132,7 @@ func readBins(r io.Reader, version byte) ([]bin, *index.ReferenceStats, error) {
                if err != nil {
                        return nil, nil, fmt.Errorf("csi: failed to read bin count: %v", err)
                }
-               if bins[i].bin == statsDummyBin {
+               if bins[i].bin == uint32(len(bins))+1 {
                        if n != 2 {
                                return nil, nil, errors.New("csi: malformed dummy bin header")
                        }

Thanks for looking into this. I should be able to get to this in the next few days. Please ping me if I have not.

This is not fun. Having no coherent spec makes this way more difficult than it should be. The code above is not correct because it uses the number of bins rather than the maximum possible bin number.

Interestingly there appears to be a new counting system used in the htslib code. Good to keep people on their toes.