bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create dataset wikihow_vietnamese_human_instructions

albertvillanova opened this issue · comments

  • uid: wikihow_vietnamese_human_instructions
  • type: processed
  • description:
    • name: wikiHow Vietnamese Human Instructions
    • description: Step-by-step instructions in Vietnamese extracted from wikiHow and decomposed into a formal graph representation in RDF. For any queries and requests contact: Paolo Pareti
      To cite this dataset use:
      Paula Chocron, Paolo Pareti. Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions.
      (PDF) (bibtex)
    • homepage: https://www.kaggle.com/paolop/human-instructions-vietnamese-wikihow
    • validated: True
  • languages:
    • language_names:
      • Vietnamese
    • language_comments:
    • language_locations:
      • Asia
      • Vietnam
    • validated: False
  • custodian:
  • availability:
    • procurement:
    • licensing:
      • has_licenses: Yes
      • license_text: CC BY-NC-SA 4.0
      • license_properties:
      • license_list:
        • cc-by-nc-4.0: Creative Commons Attribution Non Commercial 4.0 International
    • pii:
      • has_pii: Yes
      • generic_pii_likely: somewhat likely
      • generic_pii_list:
        • names
        • website account name or handle
        • URLs
      • numeric_pii_likely: somewhat likely
      • numeric_pii_list:
      • sensitive_pii_likely: somewhat likely
      • sensitive_pii_list:
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • processed_from_primary:
    • from_primary: Taken from primary source
    • primary_availability: Yes - their documentation/homepage/description is available
    • primary_license: Yes - the dataset has the same license as the source material
    • primary_types:
      • web | wiki
    • validated: False
    • from_primary_entries:
  • media:
    • category:
      • text
    • text_format:
      • other
      • RDF
    • audiovisual_format:
    • image_format:
    • database_format:
      • .ZIP
    • text_is_transcribed: No
    • instance_type: Sentences / instructions
    • instance_count: 1K<n<10K
    • instance_size: 10<n<100
    • validated: False
  • fname: wikihow_vietnamese_human_instructions.json

#self-assign

Note: This is part of a multilingual resource: https://www.kaggle.com/paolop/human-instructions-multilingual-wikihow

Parent project: http://paolopareti.uk/homepage/prohow/index.htm

For the multilingual dataset, this is the list of the available languages and number of articles in each:

English: 133.842

German: 57.533

Hindi: 6.519

Russian: 127.738

Korean: 7.606

Portuguese: 92.520

Italian: 79.656

French: 60.105

Spanish: 120.507

Chinese: 82.558

Czech: 10.619

Arabic: 15.589

Thai: 10.213

Vietnamese: 8.670

Indonesian: 39.246

Dutch: 19.318

Dataset is in RDF format.