EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

United Nations Publications

cfoster0 opened this issue · comments

Languages: English, French, Spanish, Arabic, Russian, Chinese (should have translations for all of these)
Date ranges: 1946-2020
Size: 700,000 publications
Link to UN digital library.

Outstanding questions:

  • How many of these are downloadable through the portal?
  • Are all of the documents available in all languages?
  • What is the total corpus size (in bytes) we should expect from this?

Splitting out the speeches into a separate Issue #39 .

The PDF translations of a given document in the library are listed at https://digitallibrary.un.org/record/[NUMBER]/files/

I'm not entirely sure yet how they're ordered.

@StellaAthena Feel free to assign this to me.

I've completed the url-collection portion of this. There are a bit over 1.8M downloadable PDFs in the database, spread fairly evenly across the 6 official languages.


@cfoster0 What ended up happening with this?

Nothing new. If someone is interested in downloading the docs and/or converting them to text, I'd be happy to share. Was waiting for v1 work to finish, otherwise.