38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`d4::task::Histogram::with_bin_range()` returning distribution that is missing `0`s

pamelarussell opened this issue · comments

Hi,

I am seeing some unexpected behavior in calls to d4::task::Histogram::with_bin_range(). Specifically, I am invoking the method on a region that includes some 0s and some positive values, but the returned histogram does not include the 0s. The other value counts are correct.

Here is our line of code where we are invoking this.

Here is a test where we try to use this code to compute the median of a histogram whose underlying values are 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 3 3.

The test fails because the median is returning 2, which is the median of the non-zero values in the region, instead of 0. I have verified via debugging that this can be traced to the histogram at the above line of code not counting the 0s.

Here is the D4 file we are using for this test.
example2.d4.zip

Thanks in advance for any help you can suggest!

Seems related to #12. Does it evaluate successfully if reading from a d4 made with -S?

Thanks @snystrom , unfortunately rerunning with a file generated with -S did not solve the issue.

Hi @pamelarussell,

Thanks for reporting the issue. After digging into the issue, I found this is due to the way how the sparse D4 file is handled. Instead of the normal per-base mode, for the sparse D4 file, all the stats are handled per-interval.

However, previously we don't take the value that is not defined by the secondary table into account. I've added the code that handles those value that is defined in the primary table. It seems the issue has been fixed after I added the change, please let me know if this is the case on your side.

Cheers,
Hao

Seems related to #12. Does it evaluate successfully if reading from a d4 made with -S?

This should be related to enabling -S

The recent change to d4tools makes the tool smarter to detect the optimal parameters, so it will automatically enable -S. But in this case, -S option actually makes the stat task executed in a per-interval mode. And there is a bug the per-interval mode code is dropping all the values that is not defined in the secondary table.

Hi @38, I tried building the version from yesterday's commit, but d4tools create ran for an hour before I finally killed it. I'm not sure what is wrong; I'm trying to build from a pretty small bedgraph file. I tried both with and without the -S option. The latest stable version from conda works fine. Here are the bedgraph file and genome file. bedgraph.zip

Additionally, could you please clarify whether you believe the issue is with initial creation of the D4 file or later reading of the file? I ask because I am potentially using different versions of the tools for both. My failing test is referring to a crate we've built on top of the latest tagged version here (0.3.6), and that is where we are seeing that the histogram is missing 0s.

The latest stable version from conda works fine.

How long the stable version from conda takes for this input? I think this is related to the fact that in your bedgraph file, there are a huge 0 valued intervals. We should encode those values with interval mode - but unfortunately currently this is not the case. So I think for your input, both stable version and conda version should be slow. (Also make sure you are testing against release build)

Additionally, could you please clarify whether you believe the issue is with initial creation of the D4 file or later reading of the file?

For the original issue, I believe this is not related to create - you can use d4tools view to verify that. The problem is the aggregation method doesn't handles the default value.


I have another fix for the long creation time issue committed to the branch, please let me know if this works on your side.

Thanks!

The stable version from conda took a couple of minutes (which did seem unexpectedly long), while the latest from GitHub ran for over an hour before I killed it.

If the original issue is related to reading/processing the file rather than writing, then I could use a bit of help with how to test your latest fix. I am not a Rust user and am building our crate from a Cargo.toml (here) that @sstadick built which references d4 version 0.3.6 on crates.io. How can I instead build our project from your sources rather than the package repository?

commented

Hi @pamelarussell! You should be able to specify a commit in cargo as follows:

[dependencies]
d4 = { git = "https://github.com/38/d4-format", rev="<commit-hash>" }
extendr-api = { git = "https://github.com/extendr/extendr", branch="master" }
ordered-float = "2.10.0"
serde = { version = "1.0.136", features = ["derive"] }

Cargo will look for the d4 crate in the d4-format workspace by default and use the commit specified in rev. You can also use d4 = { git = "https://github.com/38/d4-format", branch="master" } to just stay on the latest from @38

Thanks all. The latest fix seems to have worked and I am getting correct histograms now! 👍