38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New stat function: % of positions in intervals with coverage >X

Jakob37 opened this issue · comments

This is something we are looking at over at the https://github.com/Clinical-Genomics/chanjo2 repo.

The current approach is to read the output from the d4 file using d4tools show. It would be much quicker to do this directly in d4tools.

The execution could be something like:

$ d4tools stat --stat perc_cov "10,20,30" --region "intervals.bed"

The desired output would be something like:

chr     start           end             10x  20x   30x
19      31030023        32034023        98.3 96.3  94.1
19      49624990        49625000        94.2 90.3  85.3
19      49625010        49625020        88.5 80.3  75.3

Feedbacks on this are welcome. If I have the time, and no one else jumps on it, I'll give it a go and see if I can implement it.

Funnily enough I've been working on something related to this using the d4 crate, though I hadn't considered implementing it within d4tools show. Unfortunately I haven't had time lately to finish it up.

Your idea should be relatively straightforward to implement using d4's Tasks. I'd be happy to take a stab when I have some more free time in the coming weeks. Also can chip in if you've already started!

Funnily enough I've been working on something related to this using the d4 crate, though I hadn't considered implementing it within d4tools show. Unfortunately I haven't had time lately to finish it up.

Your idea should be relatively straightforward to implement using d4's Tasks. I'd be happy to take a stab when I have some more free time in the coming weeks. Also can chip in if you've already started!

This is great to hear @cademirch I haven't gotten started with this yet. This would be extremely useful for us, as it would resolve our (hopefully) final performance bottleneck before we could use this in production.

I would be happy to help out with testing or in any other ways that might be helpful, just let me know!

@Jakob37 Ended up having some free time today and took a crack at this here: https://github.com/cademirch/d4-format/tree/perc_cov

I think it works as expected... I tested a little on some random d4 files:

d4tools show -R regions.bed input.d4:

JANIGQ010000002.1       0       5       0
JANIGQ010000002.1       5       29      1
JANIGQ010000002.1       29      80      2
JANIGQ010000002.1       80      100     3

Then, d4tools stat -H -s perc_cov=1,2,5 -r regions.bed input.d4:

#Chr    Start   End     1x      2x      5x
JANIGQ010000002.1       0       100     0.940   0.700   0.000

Currently this wont work with multi-track d4 files, so just fyi. Also, I'm super new to Rust so this is probably not merge worthy. Lmk what you think!

Nice @cademirch , I'll take a look during the day!

I am trying this out. It looks great!! Our previous way of processing 74 intervals with three thresholds took ~15 seconds. With your function it takes only 0.5s.

The results looks identical to the ones we produces as well. So it looks like it is both quick and doing what it should. Nice job 💪

It would be great to get this one into d4tools in some way actually. I looked at the diff, and the code looks well organized, but I am unfortunately not experienced in Rust either.

Still, could you put this up as a PR? And then we could see if anyone else could take a look at it.

Hi again! What do you think about making a PR of this @cademirch ? It worked fine for me when testing it some weeks ago. I can help out with further testing as well.

This is something we really need, so I am thinking about how we could drive this forward.

@Jakob37 Thanks for the reminder. Sorry I fell off this - got caught up with some deadlines. I can open a PR tomorrow, would be great to add some tests!

@Jakob37 Thanks for the reminder. Sorry I fell off this - got caught up with some deadlines. I can open a PR tomorrow, would be great to add some tests!

No worries, sounds great! Yes, I can do both some manual testing and look into how to unit test here .. Will look into it when the PR is up.

I am in office this week, away for two weeks, and then back again.