CAMeL-Lab / camel_morph

Camel Morph’s goal is to build large open-source morphological models for Arabic and its dialects across many genres and domains.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Empty features `ud` and `catib`

mirkovogel opened this issue · comments

The following observation concerns the LREC-Coling 2024 release (camel_morph/official_releases/lrec-coling2024_release/databases/camel-morph-msa):

The features catib6 and ud are always empty, e.g. in the following analysis of "فبسبب":

{
  'bw': 'فَ/CONJ+بِ/PREP+سَبَب/NOUN+ِ/CASE_DEF_GEN',
  'ud': '',
  'catib6': ''
}

The expected values are:

{
  'ud': 'CCONJ+ADP+NOUN	',
  'catib6': 'PRT+PRT+NOM'
}

Comment from @christios by mail:

As you've rightly pointed out, ud and catib are missing as we did not include those in the release (it was not our focus). But you are right they should be included in the next release. It should not be very difficult, probably just a mapping between the CAPHI POS (or Catib) and UD.

I am currently working on transitioning my pipeline to from the r13 morphological db to Camel Morph MSA, and need both catib6 and ud tags downstream, So I'd volunteer to help with this, if I can.

Maybe there already is code to convert between the "native" pos tags of the database (https://camel-tools.readthedocs.io/en/latest/reference/camel_morphology_features.html?) to other tag sets, I could use in the meantime?