Create dataset opus_100
albertvillanova opened this issue · comments
- uid: opus_100
- type: processed
- description:
- name: OPUS-100
- description: OPUS-100 is an English-centric multilingual corpus covering 100 languages.
- homepage: https://github.com/EdinburghNLP/opus-100-corpus
- validated: True
- languages:
- language_names:
- Niger-Congo
- Arabic
- Catalan
- Chinese
- English
- French
- Indonesian
- Portuguese
- Spanish
- language_comments:
- language_locations:
- Netherlands
- Switzerland
- Scotland
- validated: False
- language_names:
- custodian:
- name: Jörg Tiedemann
- in_catalogue:
- type: A private individual
- location: Turkey
- contact_name:
- contact_email: jorg.tiedemann@helsinki.fi
- contact_submitter: False
- additional: https://opus.nlpl.eu/
- validated: False
- availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: https://opus.nlpl.eu/opus-100.php
- download_email:
- licensing:
- has_licenses: Unclear
- license_text: Quoting the website "In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.", I believe the data can be used.
- license_properties:
- license_list:
- pii:
- has_pii: Yes
- generic_pii_likely: very likely
- generic_pii_list:
- names
- physical addresses
- numeric_pii_likely: unlikely
- numeric_pii_list:
- sensitive_pii_likely: unlikely
- sensitive_pii_list:
- no_pii_justification_class:
- no_pii_justification_text:
- validated: False
- procurement:
- processed_from_primary:
- from_primary: Taken from primary source
- primary_availability: Yes - their documentation/homepage/description is available
- primary_license: Unclear / I don't know
- primary_types:
- news articles
- web | wiki
- web | other
- validated: False
- from_primary_entries:
- media:
- category:
- text
- text_format:
- .TXT
- audiovisual_format:
- image_format:
- database_format:
- .GZ
- .TAR
- text_is_transcribed: No
- instance_type: sentence pair
- instance_count: 1M<n<1B
- instance_size: 10<n<100
- validated: False
- category:
- fname: opus_100.json
It already exists: https://huggingface.co/datasets/opus100
Need to pass the language pair: en + ?
DONE:
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ca_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_eu_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_id_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-as_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-bn_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-gu_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-hi_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-kn_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ml_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-mr_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-or_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-pa_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ta_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-te_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ur_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-ig_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-rw_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-xh_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-yo_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-zu_opus100