Bug in encoding data with pyd4 when using sparse
mrvollger opened this issue · comments
Hello,
I have attached an example where pyd4 doesn't encode the underlying data correctly when using the sparse builder (I think):
import pyd4
import pandas as pd
import numpy as np
def chrom_bg(sts, ens, chrom_len):
chrom = np.zeros(chrom_len, dtype=np.int32)
to_add = np.int32(1)
for st, en in zip(sts, ens):
chrom[st:en] += to_add
print(f"total_coverage = {chrom.sum()}")
return chrom
df = pd.read_csv("1.bed.gz", sep="\t", header=None, comment="#")
writer = (
pyd4.D4Builder("1.d4")
.add_chroms([("chr11",10_000_000)] )
.for_sparse_data()
.get_writer()
)
data = chrom_bg(df[1].to_numpy(), df[2].to_numpy(), 10_000_000)
writer.write_np_array("chr11", 0, data)
writer.close()
d4_sum = pyd4.D4File("1.d4")["chr11"].sum()
df_sum = (df[2] - df[1]).sum()
assert d4_sum == df_sum, "{} != {}".format(d4_sum, df_sum)
Error:
total_coverage = 3821
Traceback (most recent call last):
File "/Users/mrvollger/Desktop/repos/fibertools/fibertools/test_pyd4.py", line 28, in <module>
assert d4_sum == df_sum, "{} != {}".format(d4_sum, df_sum)
AssertionError: 0 != 3821
Data file:
1.bed.gz
However, if I comment out the .for_sparse_data()
I get the correct results.
Am I somehow doing the sparse encoding wrong?
Thanks,
Mitchell
Hi Mitchell,
Thanks for reporting this bug. I confirmed this bug and this is due to pyd4 doesn't flush the last data chuck and your input is small enough that all data is lost.
I've published a fixed version of pyd4 on pythonpi, would you mind confirm my fix works on your side?
Thanks,
Hao
Hi Hao,
Thanks for a quick fix! It looks like everything on my end is now good.
Thanks again,
Mitchell