bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create dataset opus_100

albertvillanova opened this issue · comments

  • uid: opus_100
  • type: processed
  • description:
  • languages:
    • language_names:
      • Niger-Congo
      • Arabic
      • Catalan
      • Chinese
      • English
      • French
      • Indonesian
      • Portuguese
      • Spanish
    • language_comments:
    • language_locations:
      • Netherlands
      • Switzerland
      • Scotland
    • validated: False
  • custodian:
  • availability:
    • procurement:
    • licensing:
      • has_licenses: Unclear
      • license_text: Quoting the website "In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.", I believe the data can be used.
      • license_properties:
      • license_list:
    • pii:
      • has_pii: Yes
      • generic_pii_likely: very likely
      • generic_pii_list:
        • names
        • physical addresses
      • numeric_pii_likely: unlikely
      • numeric_pii_list:
      • sensitive_pii_likely: unlikely
      • sensitive_pii_list:
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • processed_from_primary:
    • from_primary: Taken from primary source
    • primary_availability: Yes - their documentation/homepage/description is available
    • primary_license: Unclear / I don't know
    • primary_types:
      • news articles
      • web | wiki
      • web | other
    • validated: False
    • from_primary_entries:
  • media:
    • category:
      • text
    • text_format:
      • .TXT
    • audiovisual_format:
    • image_format:
    • database_format:
      • .GZ
      • .TAR
    • text_is_transcribed: No
    • instance_type: sentence pair
    • instance_count: 1M<n<1B
    • instance_size: 10<n<100
    • validated: False
  • fname: opus_100.json

It already exists: https://huggingface.co/datasets/opus100

Need to pass the language pair: en + ?

DONE:

https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ca_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_eu_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_id_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_opus100

https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-as_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-bn_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-gu_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-hi_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-kn_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ml_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-mr_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-or_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-pa_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ta_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-te_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ur_opus100

https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-ig_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-rw_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-xh_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-yo_opus100
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nigercongo-zu_opus100