38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

no apparent size reduction of multi-track files as function of no of samples

percyfal opened this issue · comments

Hi,

I'm working on a large genome where we have many samples (~1000). For downstream analyses I need to generate per-base sequence masks based on coverages. To save space, I thought I could utilize the merge feature to merge sample-specific d4 coverage files, generated by mosdepth, to a multi-track d4 file, where the columns correspond to coverages from different samples (I know that this isn't the primary use case for the multi-tracks, I'm simply playing around with the functionality). As expected, the increase in number of regions decreases with increasing sample size:

fig-d4tools-merge-plot-number-of-entries-1

My naïve - and here I probably don't understand the algorithmic underpinnings - hope was that the increase in file size would also slow down, but that does not seem to be the case:

fig-d4tools-merge-plot-file-size-1

Is this to be expected, i.e. is this a futile use of track merging as it is implemented right now?

In any case, thanks for a great tool that is making it even possible to work with data sets this large.

Cheers,

Per

I don't know the answer, but I am curious in general what docs exist about the multitrack features. My assumption is that no additional compression is happening for multitrack files, but I can't find details in either the cargo or other language API docs about the feature.