cldf / pyigt

Handling Interlinear Glossed Text in python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Write concordances as components of the CLDF dataset

xrotwang opened this issue · comments

It would be nice if the output created by pyigt could be fed back directly into the CLDF dataset. It's unclear, though, which intermediate outputs would make sense, rather than going for a full CLDF Dictionary.

Yes, that is something we need to address in the future. Ideally, the dictionary would give IDs for word families, so they could be referenced consistently in IGT, right? But that won't happen, unless somebody does it, so we may have to wait for the time being here.

In principle, pyigt can create a list of morphemes, each linked to (multiple) lexical and (multiple) grammatical concepts as well as to examples the morphemes occur in. It's unclear, how this would fit into - presumably - a CLDF Wordlist, though. Do we want to put morphemes in a FormTable - or words? Lexical and grammatical concepts could probably go into a ParameterTable, but how would we link to this from FormTable? Have one row per morpheme occurrence in an example - linked to multiple parameters?

I think, the answer to these questions depends on what the resulting dataset is going to be used for - in particular, if it's supposed to be used as is or whether it's going to be the input for another data curation step - maybe on the way to a CLDF Dictionary.

So, rather than implement some sort of generic Corpus.enrichmethod, which adds FormTable and ParameterTable to the CLDF data a corpus was derived from, I'd see pyigt as a tool to be used in cldfbench makecldf code, where extra info about the example corpus at hand can inform the usage.