Background distribution handling

Question

Background distribution handling

biodataganache opened this issue 2 years ago · comments

To address the issue of loading all kmer matrices in to memory for the model pipeline (both score and model rules do this) we can create background distributions from all kmers in a dataset. This could be constructed ahead of time - in a special pipeline 'build-dist' or can be done on the fly to build for each individual family in a thread. The score and model rules can load these background distributions - which will only be a bit bigger than the length of the kmers. Then combined, then used to score and model*. *model is something I'm not as clear about how to do.

Jason McDermott · Answer 1 · Mon Aug 29 2022 03:34:31 GMT+0800 (China Standard Time)

This would allow the creation of generalized kmer background distribution files that could be pre-constructed and used for particular k/alphabet combinations. That would mean that the user wouldn't have to worry about supplying a background and could train a model that way. These could be included in the repo.