allenai / peS2o

Pretraining Efficiently on S2ORC!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consideration for fasttext

chris-ha458 opened this issue · comments

I am wondering if any consideration for fasttext have been made.

First of all it can provide much more languages.
The basic version available on their site provides 176 languages in either a 150MiB or 1MiB(quantized) package
Secondly, compared to gcld3 it is more actively maintained
The codebase itself is forked for development by other parties fastertext
and even facebook recently released a new model that extends language support into 200 languages.
3rd party models are available as well

Hi @chris-ha458!

While I agree that fasttext is generally better than cld3, I did not notice any meaningful difference for this application when I compared the two. However, we might reconsider for the next release! Thanks.

I guess for the purpose of rejecting non-english it is not that meaningful.
I hope to see this be extended multilngually either by the original developers or other opensource developers so hoped to leave the info.