TutteInstitute / vectorizers

Vectorizers for a range of different data types

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Vectorizing Terabyte-order data

cakiki opened this issue · comments

Hello and thank you for a great package!

I was wondering whether (and how) ApproximateWassersteinVectorizer would be able to scale up to terabyte-order data or whether you had any pointers for dealing with data of that scale.

Right now the work largely focusses on datasets that can fit in memory, since our interest has been in building (and hopefully popularising) algorithms for these sorts of tasks, rather than necessarily scaling them, which introduces a lot of extra problems and implementation work. So if you can fit your data into memory (or, at least, the sparse matrix representation from ngram-vectorizer or similar) then it will scale, it will just be slower. But if you don't have a box with terabytes of RAM handy then the correct approach would be to use something like dask for out-of-core computation. None of that is built in right now, and it would involve a little work to enable such a thing as well as keeping the existing in memory performance. It may be possible to take the code we have and daskify it yourself. You'll likely also want to take advantage of dask versions of sparse matrices (so either pydata sparse, or building your own version per some of the dask documentation).

Thank you so much for the detailed explanation @lmcinnes; super useful stuff!