cldf / pyigt

Handling Interlinear Glossed Text in python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Morphemes glossed with numbers should be grammatical, not lexical morphemes

fmatter opened this issue · comments

Since the distinction is made between grammatical and lexical morphemes, a gloss like 1 or 1SG (not tested) or 1>3 should be categorized as grammatical, just like ERG. I noticed it in the results of corpus.get_wordlist(); I am not sure if this categorization happens elsewhere.

Depends on your perspective. For me, as one who wants to pull out a Swadesh list of the Qiang data, I count them as non-grammatical.

What counts as grammatical and was not depends on a regex, which can be configured:

pyigt/src/pyigt/igt.py

Lines 43 to 44 in ed75e26

def is_grammatical_gloss_label(self, s):
return bool((s in ABBRS) or self.label_pattern.match(s))

As you can do by triggering the label_pattern:

label_pattern = attr.ib(default=re.compile('^([A-Z]+|([1-3](DL|PL|SG)))$'))

If 1SG is matched, but 1sg would not be matched, 1 would not be matched (for good reasons, as it can even be a number).

Depends on your perspective. For me, as one who wants to pull out a Swadesh list of the Qiang data, I count them as non-grammatical.

…right, I hadn't considered that perspective :)

What counts as grammatical and was not depends on a regex, which can be configured

So do we modify label_pattern in the igt.py file? Or where in the workflow would that happen?

Aaah, so I would pass a custom CorpusSpec instance like so?

class MyCorpusSpec(object):
    …
    label_pattern = attr.ib(default=re.compile('^([A-Z]+|([1-3](DL|PL|SG)))$'))
    …

text = Corpus.from_cldf(ds.cldf_reader(), spec=MyCorpusSpec)