m0rp43us / dataset

darija <-> english dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Darija Open Dataset

Darija Open Dataset (DODa) is an open-source project for the Moroccan dialect. With more than 21,000 entries DODa is arguably the largest open-source collaborative project for Darija <=> English translation built for Natural Language Processing purposes.

In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, as well as more that 10,000 translated sentences.

This open source project aims to be a reference in NLP Darija. We hope for the contribution of the Moroccan IT community in order to provide a pedestal for any future application of NLP for the benefit of Moroccans.


wordcloud of DODa.

Check out this introductory video about DODa.


How to contribute

We've made a tutorial for you in DODa's website


Guidelines / Recommendations

  1. 3ndk ح dir ح xD (shout-out to this guy 😆), often try to use:
darija 3 7 9 8 2 - 'a' - 'i' 5 - 'kh'
arabic ع ح ق ه همزة خ
  1. Try to use capitalization to differentiate between the following letters:
t T s S d D
ت ط س ص د ض
  1. Arabic characters with two-letters Latin equivalent:
Arabic alphabet ش غ خ
Latin alphabet ch gh kh
  1. Double characters to refer to the emphasis or "الشدة":
darija 7mam 7mmam
english pigeons bathroom
  1. We usually don't add "e" in the end of darija words : louz instead of louze

  2. We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)

  3. When using commas, don't forget to surround the expression by quotation marks (as we are using csv files)

  4. We use spaces as word delimiters, not _ nor - : thank you instead of thank_you

  5. Respect the number of columns in every row you add, you can use empty quotation marks "", or just empty placeholder, in case you don't have extra variations

"sou9","souk","","market"

sou9,souk,,market

  1. In each row, always start with the most used form (in your opinion of course) of the word in question

  2. For future use of this dataset to train deep neural networks, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

sou9,souk,souq,market

marchi,,,market

  1. verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

ghnna,ghenna,ghanna,,,,sing

  1. masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

PyDODa - Python wrapper for the DODa

Python Badge

Pydoda is a comprehensive Python library that simplifies access and analysis of the DODa dataset. It enables effortless exploration of linguistic content for researchers, developers, and language enthusiasts by providing an intuitive interface for accessing various dataset categories, retrieving spellings and translations.

Integrating Pydoda into your Python workflow grants access to a wide range of functionalities, facilitating insights extraction from the DODa dataset, including semantic and syntactic analysis, translation retrieval, spelling variations exploration, and more.

Usage example

Pydoda could easily be installed using pip:

pip install pydoda

Here is a small code snippet:

from pydoda import Category

# Create an instance of Category
my_category = Category('semantic', 'animals')

# Get the Darija translation of a word
darija_translation = my_category.get_darija_translation('dog')
print(darija_translation)
# Output: klb

# Get the English translation of a word
english_translation = my_category.get_english_translation('mch')
print(english_translation)
# Output: 'cat'

For further details, visit the official Pydoda GitHub repository & official Pydoda documentation.

Citation

@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

darija <-> english dataset

License:Other


Languages

Language:Python 100.0%