EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exploiting bitexts

eritain opened this issue · comments

GPT-3 has demonstrated that a massively pretrained Transformer language model can do an OK job of machine translation. Not state of the art, but not negligible. Safe to presume there will be interest in doing the same with any open-source equivalent trained on the Pile.

The GPT-3 training data probably did pick up paired, translation-equivalent texts that happened to be collected in Common Crawl or WebText2. But there are high-quality resources that can target that area. Some are already in the Pile, or are pending: OpenSubtitles #10, Europarl #25, and the UN documents #38 #39.

What I don't know is if the Pile includes them with aligned sentences interleaved, which would make masked language modeling equivalent to the translation objective in XLM; with equivalent documents adjacent, which would be a weak approximation of the same; or with documents grouped by language, which would entirely miss out on some valuable information.

With those resources (and some smaller ones) sentence-aligned, LASER did very well learning to map 93 different languages to a shared sentence-embedding space and using it for zero-shot cross-lingual tasks. So there's definitely good data there.

For discussion's sake, the other resources LASER used were (a subset of) the Tatoeba database, Global Voices news stories, and Quran translations from Tanzil. Bible translations are another similar resource. Might be worth hitting up Duolingo and Glossika for sentence pairs also, to the extent that it can be done without giving away their whole product.

Thoughts?

This would be a good addition to the Pile.

Will translations be added?