(Natural) Languages in The PILE
suzyahyah opened this issue · comments
suzyahyah commented
Hi,
Has there been any Language ID of the sentences in PILE, and also quantifying their proportions? We can get an idea from Europarl, but it is less clear with Common crawl in the mix.
I have not seen this in any of the official documentation or the paper. If I missed something please let me know.
Thanks!
Daryl commented
@suzyahyah Read their paper, page 9.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf
A fully multi-lingual expansion of the Pile is in their future plans. I don't know whether or not that includes being able to differentiate between languages it's speaking.