GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fasttext is the lightext and bpemb is the most accurate?

freud14 opened this issue · comments

Hi,
The fasttext model needs 9 GB of RAM to be loaded but is the most accurate if my memory serves? So it would be the most accurate but not the lightest. Also, I think it should be mentionned somewhere in the doc that it takes that much memory to be loaded?
Thank you.

Hi,
The fasttext model needs 9 GB of RAM to be loaded but is the most accurate if my memory serves? So it would be the most accurate but not the lightest. Also, I think it should be mentionned somewhere in the doc that it takes that much memory to be loaded?
Thank you.

Yes, this part is not clear.

Will rework that, and good point about the memory usage.

We will make some test of ram size and compute time both on gpu and cpu to be more clear and precise.

Actually Fasttext gives a better mean performance when taking multiple seeds into account but with the two seeds that we've chosen for the package BPEmb is more accurate. Also, what's lighter with fastext is the Seq2Seq since it doesn't have an embedding_network and the embeddings are fully handled by the fastest package. I do agree that fasttext package + the Seq2Seq might be heavier. We'll check as mentioned by @davebulaval

After numerous research, I could not find a way to shrink the size of the bin model. Also, since we use the subword, we cannot simply remove the unsee word in a .txt format since we will loose the mapping and everything.