mainlp / convert-qcri-4dialects

Converts the Four Arabic Dialects POS tagged Dataset (Darwish ea 2018) to UPOS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

qcri_dialectal-arabic-resources_UPOS

This script converts the Four Arabic Dialects POS tagged Dataset (released under the Apache License 2.0; article: Darwish ea 2018) to UPOS tags.

It was used for the paper Does manipulating tokenization aid cross-lingual transfer? A study on POS tagging for non-standardized languages (Blaschke ea, VarDial 2023, link).

The resulting files have one token per line, with empty lines indicating sentence boundaries. Tokens are annotated like this:

form	tag	merged tag sequences (if applicable)

Merged tag sequences are included when the flag --include_tag_details is used.

Usage

# Optional preliminary check:
python3 check_arabic_segmentation.py dialectal_arabic_resources/seg_plus_pos_egy.txt dialectal_arabic_resources/seg_plus_pos_lev.txt dialectal_arabic_resources/seg_plus_pos_glf.txt dialectal_arabic_resources/seg_plus_pos_mgr.txt  > arabic_preprocessing.log

# The actual data conversion:
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_egy.txt --out test_dar-egy_UPOS.tsv
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_glf.txt --out test_dar-glf_UPOS.tsv
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_lev.txt --out test_dar-lev_UPOS.tsv
python3 convert.py --dir dialectal_arabic_resources/ --files seg_plus_pos_mgr.txt --out test_dar-mgr_UPOS.tsv

# Optional checks:
python3 validate_converted_file.py test_dar-egy_UPOS.tsv tagset_upos.txt
python3 validate_converted_file.py test_dar-glf_UPOS.tsv tagset_upos.txt
python3 validate_converted_file.py test_dar-lev_UPOS.tsv tagset_upos.txt
python3 validate_converted_file.py test_dar-mgr_UPOS.tsv tagset_upos.txt

Details

See also Appendix B of our paper TBD. The inverse table (sorted by UPOS tag) can also be found there.

Relevant documentation:

Original tag Description UPOS Note
ABBREV Abbreviation not in dataset
ADJ Adjective ADJ restore merged sequences where relevant: (DET+)ADJ(+CASE/NSUFF) -> ADJ
ADV Adverb ADV --
CASE Case (tanween) merged with NOUN/ADJ morphemes where possible, otherwise X
CONJ Conjunction CCONJ, (SCONJ) UD Arabic documentation: "subordinating and coordinating conjunctions are not distinguished (the CCONJ tag is used)", although there is an exception for a small group of subordinating conjunctions/particles (cf. Ryding pp. 422, 611, 673)
DET Determiner DET merged with NOUN/ADJ morphemes where possible
EMOT Emoji SYM --
FOREIGN Foreign words, non-words X
FUT_PART Prefix or particle marking future tense AUX
HASH Hashtag X Using the actual POS tag as recommended by Sanguinetti ea is too difficult. Settling for X since it 1. matches what several other treebanks are doing, and 2. matches that non-Arabic tokens are X, and many hashtags are non-Arabic too (although there also are many Arabic hashtags).
JUS Jussive form of verb not in dataset, but would be VERB with Mood=Jus
MENTION Mention PROPN per Sanguinetti ea's recommendation
NEG_PART Negation particle PART
NOUN Noun NOUN restore merged sequences where relevant: (DET+)NOUN(+CASE/NSUFF) -> NOUN
NSUFF Noun suffix merged with NOUN/ADJ morphemes where possible, otherwise X
NUM Number NUM --
PART Particle PART, SCONJ see below
PREP Preposition ADP --
PROG_PART Progressive particle merged with VERB morphemes where possible, otherwise X can be marked with Aspect=Prog when joined with a verb; a potentially tricky assignment as progressivity is handled differently in MSA and non-standard Arabic (Brustad pp. 142, 246/7)
PRON Pronoun PRON --
PUNC Punctuation PUNCT --
URL URL SYM
V Verb VERB --
VSUFF Verbal suffix not in dataset

About

Converts the Four Arabic Dialects POS tagged Dataset (Darwish ea 2018) to UPOS


Languages

Language:Python 100.0%