38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Efficiently summing coverage tracks from multiple d4 files

percyfal opened this issue · comments

Hi,

I'm running variant calling on non-model organisms, and for some of the downstream analyses (e.g., nucleotide diversity calculations), it is necessary to generate (possibly boolean) accessibility masks that classify sites as accessible for analysis at a single base-pair resolution. Accessibility masks can be generated by summing coverages over all samples and masking out sites with too low or too high coverage. In addition, one could mask sites based on the number/fraction of individuals having sufficient coverage, i.e., absence/presence calls (cf https://onlinelibrary.wiley.com/doi/full/10.1111/mec.16077, Table 3). The genomes in question are so large that it is not possible to generate variant files including monomorphic sites on which to perform filtering.

Until now I have been using the Python API to sum coverages and count the number of indivuduals with coverages within a threshold range for each site. This is somewhat slow so I was wondering whether this functionality could be added directly to the d4 Rust library. I gave it a try based on the merge function, but my Rust knowledge is somewhat limited.

I'm thinking of commands somewhere in the line of

d4tools sum file1.d4 file2.d4 ... fileN.d4 outfile.d4

and

d4tools count file1.d4 file2.d4 ... fileN.d4 outfile.d4 --min-coverage 3

I'd be happy to submit a pull request if I could get pointers on where to start. What are your thoughts on this - do you prefer cases like these to be handled by external APIs (e.g., Python) or is it amenable to implementation in Rust?

Cheers,

Per

Hi @percyfal This is something I've been working on for our workflow snpArcher. I'm glad to see there is interest in a function/tool for generating accessibility masks via coverage. I don't have a repo for this yet, but will soon and can let you know when its available.

Thanks for the heads up @cademirch. FYI, I ended up drafting a Python package to perform the tasks detailed above. You can find the code at https://github.com/percyfal/d4utils. BTW, say hi to Erik with whom I previously have collaborated.

Awesome - just took a quick peek and it's looks great! I will definitely share with my colleagues, and perhaps we can integrate this in to our workflow. I'd also be happy to contribute if you are open to it - can discuss in your repo.

I'll let Erik know! Small world :)