WordCountVectorizer Memory Issue

Question

WordCountVectorizer Memory Issue

Boorinio opened this issue a year ago · comments

Hello,
I noticed during training a model with WordCountVectorizer for category classification and english stemmer with a vocabulary of 3k words and a small MLP model my ram usage jumps to 26 gbs. Do we know what's causing this? in order to get similar ram usage in python i need to make a vocabulary of 60-70k words approximately. Am i doing smth wrong maybe?

Best Regards and thanks for the hard work!

Andrew DalPino · Answer 1 · Fri Jun 23 2023 01:50:57 GMT+0800 (China Standard Time)

I'll have to take another look at the SciKit Learn Count Vectorizer (I'm assuming this was the one you were using in Python) but at first glance there's a difference in the datastructures used under the hood that has a big effect on memory usage.

https://github.com/scikit-learn/scikit-learn/blob/364c77e04/sklearn/feature_extraction/text.py#L931

Rubix ML Word Count Vectorizer uses PHP arrays under the hood whereas SciKit Learn's Count Vectorizer is using sparse NumPY arrays under the hood. I'm pretty sure int/float scalars in PHP arrays occupy > 128 bits when you account for the 64 bit float/int plus the extra zval "metadata" like reference count, and then another 64 bits to store the index. In contrast, NumPY arrays do not store a separate index, nor any extra metadata, and scalars need not be 64 bit, they can go as low as 8 bits I believe. Combine that with a sparse implementation (0's are not explicitly allocated in memory) you get a huge memory savings with the Python implementation when used to represent word count vectors.

Boorinio · Answer 2 · Fri Jun 23 2023 20:04:47 GMT+0800 (China Standard Time)

Alright thanks for the answer!