38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug in encoding data with pyd4 when using sparse

mrvollger opened this issue · comments

Hello,

I have attached an example where pyd4 doesn't encode the underlying data correctly when using the sparse builder (I think):

import pyd4
import pandas as pd
import numpy as np

def chrom_bg(sts, ens, chrom_len):
    chrom = np.zeros(chrom_len, dtype=np.int32)
    to_add = np.int32(1)
    for st, en in zip(sts, ens):
        chrom[st:en] += to_add
    print(f"total_coverage = {chrom.sum()}")
    return chrom


df = pd.read_csv("1.bed.gz", sep="\t", header=None, comment="#")
writer = (
    pyd4.D4Builder("1.d4")
    .add_chroms([("chr11",10_000_000)] )
    .for_sparse_data()
    .get_writer()
)
data = chrom_bg(df[1].to_numpy(), df[2].to_numpy(), 10_000_000)
writer.write_np_array("chr11", 0, data)

writer.close()

d4_sum = pyd4.D4File("1.d4")["chr11"].sum()
df_sum = (df[2] - df[1]).sum()

assert d4_sum == df_sum, "{} != {}".format(d4_sum, df_sum)

Error:

total_coverage = 3821
Traceback (most recent call last):
  File "/Users/mrvollger/Desktop/repos/fibertools/fibertools/test_pyd4.py", line 28, in <module>
    assert d4_sum == df_sum, "{} != {}".format(d4_sum, df_sum)
AssertionError: 0 != 3821

Data file:
1.bed.gz

However, if I comment out the .for_sparse_data() I get the correct results.

Am I somehow doing the sparse encoding wrong?

Thanks,
Mitchell

Hi Mitchell,
Thanks for reporting this bug. I confirmed this bug and this is due to pyd4 doesn't flush the last data chuck and your input is small enough that all data is lost.

I've published a fixed version of pyd4 on pythonpi, would you mind confirm my fix works on your side?

Thanks,
Hao

Hi Hao,

Thanks for a quick fix! It looks like everything on my end is now good.

Thanks again,
Mitchell