EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(Natural) Languages in The PILE

suzyahyah opened this issue · comments

Hi,

Has there been any Language ID of the sentences in PILE, and also quantifying their proportions? We can get an idea from Europarl, but it is less clear with Common crawl in the mix.

I have not seen this in any of the official documentation or the paper. If I missed something please let me know.

Thanks!

commented

@suzyahyah Read their paper, page 9.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

A fully multi-lingual expansion of the Pile is in their future plans. I don't know whether or not that includes being able to differentiate between languages it's speaking.