PNNL-CompBio / Snekmer

Pipeline to apply encoded Kmer analysis to protein sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Background distribution handling

biodataganache opened this issue · comments

To address the issue of loading all kmer matrices in to memory for the model pipeline (both score and model rules do this) we can create background distributions from all kmers in a dataset. This could be constructed ahead of time - in a special pipeline 'build-dist' or can be done on the fly to build for each individual family in a thread. The score and model rules can load these background distributions - which will only be a bit bigger than the length of the kmers. Then combined, then used to score and model*. *model is something I'm not as clear about how to do.

This would allow the creation of generalized kmer background distribution files that could be pre-constructed and used for particular k/alphabet combinations. That would mean that the user wouldn't have to worry about supplying a background and could train a model that way. These could be included in the repo.