microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get the Tatoeba corpus used in Deepnet?

SefaZeng opened this issue · comments

Describe
I am collecting the corpus used in Deepnet but I can't find where to download the Tatoeba corpus. Is this what you used in the paper?

And I find the training data used in Deepnet is about 13B sentences, but it seems that M2M-100 only use 7.5 B sentences which are consists of CCMatrix and CCAlign only. So, as I understand it, it's not a fair comparison?