fasttext is the lightext and bpemb is the most accurate?

Question

fasttext is the lightext and bpemb is the most accurate?

freud14 opened this issue 4 years ago · comments

Hi,
The fasttext model needs 9 GB of RAM to be loaded but is the most accurate if my memory serves? So it would be the most accurate but not the lightest. Also, I think it should be mentionned somewhere in the doc that it takes that much memory to be loaded?
Thank you.

David Beauchemin · Answer 1 · Wed Sep 23 2020 00:26:47 GMT+0800 (China Standard Time)

Hi,
The fasttext model needs 9 GB of RAM to be loaded but is the most accurate if my memory serves? So it would be the most accurate but not the lightest. Also, I think it should be mentionned somewhere in the doc that it takes that much memory to be loaded?
Thank you.

Yes, this part is not clear.

Will rework that, and good point about the memory usage.

David Beauchemin · Answer 2 · Wed Sep 23 2020 00:28:10 GMT+0800 (China Standard Time)

We will make some test of ram size and compute time both on gpu and cpu to be more clear and precise.

Marouane Yassine · Answer 3 · Wed Sep 23 2020 00:29:52 GMT+0800 (China Standard Time)

Actually Fasttext gives a better mean performance when taking multiple seeds into account but with the two seeds that we've chosen for the package BPEmb is more accurate. Also, what's lighter with fastext is the Seq2Seq since it doesn't have an embedding_network and the embeddings are fully handled by the fastest package. I do agree that fasttext package + the Seq2Seq might be heavier. We'll check as mentioned by @davebulaval

David Beauchemin · Answer 4 · Wed Sep 23 2020 10:15:51 GMT+0800 (China Standard Time)

After numerous research, I could not find a way to shrink the size of the bin model. Also, since we use the subword, we cannot simply remove the unsee word in a .txt format since we will loose the mapping and everything.