dkoslicki / MinHashMetagenomics

Fast approximation of similarity for sets of very different sizes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Further space improvement

dkoslicki opened this issue · comments

Can significantly improve space required this by only using the k-mers that are present in the union of the training/reference genomes. This will significantly cut down on the size of the bloom filter of the sample. Would need a more creative way to estimate the cardinality of the whole sample though (e.g. Hyperloglog).