This script converts the Four Arabic Dialects POS tagged Dataset (released under the Apache License 2.0; article: Darwish ea 2018) to UPOS tags.
It was used for the paper Does manipulating tokenization aid cross-lingual transfer? A study on POS tagging for non-standardized languages (Blaschke ea, VarDial 2023, link).
The resulting files have one token per line, with empty lines indicating sentence boundaries. Tokens are annotated like this:
form tag merged tag sequences (if applicable)
Merged tag sequences are included when the flag --include_tag_details
is used.
# Optional preliminary check:
python3 check_arabic_segmentation.py dialectal_arabic_resources/seg_plus_pos_egy.txt dialectal_arabic_resources/seg_plus_pos_lev.txt dialectal_arabic_resources/seg_plus_pos_glf.txt dialectal_arabic_resources/seg_plus_pos_mgr.txt > arabic_preprocessing.log
# The actual data conversion:
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_egy.txt --out test_dar-egy_UPOS.tsv
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_glf.txt --out test_dar-glf_UPOS.tsv
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_lev.txt --out test_dar-lev_UPOS.tsv
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_mgr.txt --out test_dar-mgr_UPOS.tsv
# Optional checks:
python3 validate_converted_file.py test_dar-egy_UPOS.tsv tagset_upos.txt
python3 validate_converted_file.py test_dar-glf_UPOS.tsv tagset_upos.txt
python3 validate_converted_file.py test_dar-lev_UPOS.tsv tagset_upos.txt
python3 validate_converted_file.py test_dar-mgr_UPOS.tsv tagset_upos.txt
See also Appendix B of our paper TBD. The inverse table (sorted by UPOS tag) can also be found there.
Relevant documentation:
- Multi-Arabic POS tagging: A CRF approach (Darwish ea, LREC 2018) -- the paper describing the corpus
- Using stem-templates to improve Arabic POS and gender/number tagging (Darwish ea, LREC 2014) -- information on the Farasa tagset on which the corpus's tagset is based
- The corpus description pages for Arabic treebanks in general and for UD Arabic PADT in particular
- Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations (Sanguinetti ea, LRE 2022)
- Arabic Dialects Segmentation Guidelines -- the guidelines according to which the corpus was originally segmented
- "A reference grammar of Modern Standard Arabic" (Ryding 2005, Cambridge University Press)
- "The syntax of spoken Arabic" (Brustad 2000, Georgetown University Press)
Original tag | Description | UPOS | Note |
---|---|---|---|
ABBREV | Abbreviation | not in dataset | |
ADJ | Adjective | ADJ | restore merged sequences where relevant: (DET+)ADJ(+CASE/NSUFF) -> ADJ |
ADV | Adverb | ADV | -- |
CASE | Case (tanween) | merged with NOUN/ADJ morphemes where possible, otherwise X | |
CONJ | Conjunction | CCONJ, (SCONJ) | UD Arabic documentation: "subordinating and coordinating conjunctions are not distinguished (the CCONJ tag is used)", although there is an exception for a small group of subordinating conjunctions/particles (cf. Ryding pp. 422, 611, 673) |
DET | Determiner | DET | merged with NOUN/ADJ morphemes where possible |
EMOT | Emoji | SYM | -- |
FOREIGN | Foreign words, non-words | X | |
FUT_PART | Prefix or particle marking future tense | AUX | |
HASH | Hashtag | X | Using the actual POS tag as recommended by Sanguinetti ea is too difficult. Settling for X since it 1. matches what several other treebanks are doing, and 2. matches that non-Arabic tokens are X, and many hashtags are non-Arabic too (although there also are many Arabic hashtags). |
JUS | Jussive form of verb | not in dataset, but would be VERB with Mood=Jus | |
MENTION | Mention | PROPN | per Sanguinetti ea's recommendation |
NEG_PART | Negation particle | PART | |
NOUN | Noun | NOUN | restore merged sequences where relevant: (DET+)NOUN(+CASE/NSUFF) -> NOUN |
NSUFF | Noun suffix | merged with NOUN/ADJ morphemes where possible, otherwise X | |
NUM | Number | NUM | -- |
PART | Particle | PART, SCONJ | see below |
PREP | Preposition | ADP | -- |
PROG_PART | Progressive particle | merged with VERB morphemes where possible, otherwise X | can be marked with Aspect=Prog when joined with a verb; a potentially tricky assignment as progressivity is handled differently in MSA and non-standard Arabic (Brustad pp. 142, 246/7) |
PRON | Pronoun | PRON | -- |
PUNC | Punctuation | PUNCT | -- |
URL | URL | SYM | |
V | Verb | VERB | -- |
VSUFF | Verbal suffix | not in dataset |