EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Appending data to the Pile.

shankerabhigyan opened this issue · comments

Hi,

I wanted to know if Pile will be looking to integrate multilingual data anytime soon.
There are some organisations in India with archived scholarly articles and research work which haven't received the exposure they deserve because of language barriers in international research.

I also wanted to gain some more clarity on what are the key steps that are followed after the data is converted to the jsonlines format.
It's also been mentioned that the lm_dataset format has to be followed for the new data to be appended, could you please give more clarity on what are the key attributes of that format and how and at what point of the entire process does it relate to the final formation of GPT-J.
Thank you.

commented

@shankerabhigyan Read their paper, page 9.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

A fully multi-lingual expansion of the Pile is in their future plans.