Write concordances as components of the CLDF dataset

Question

Write concordances as components of the CLDF dataset

xrotwang opened this issue 5 years ago · comments

It would be nice if the output created by pyigt could be fed back directly into the CLDF dataset. It's unclear, though, which intermediate outputs would make sense, rather than going for a full CLDF Dictionary.

Johann-Mattis List · Answer 1 · Sat Jan 18 2020 17:04:14 GMT+0800 (China Standard Time)

Yes, that is something we need to address in the future. Ideally, the dictionary would give IDs for word families, so they could be referenced consistently in IGT, right? But that won't happen, unless somebody does it, so we may have to wait for the time being here.

Robert Forkel · Answer 2 · Wed Jul 13 2022 20:11:27 GMT+0800 (China Standard Time)

In principle, pyigt can create a list of morphemes, each linked to (multiple) lexical and (multiple) grammatical concepts as well as to examples the morphemes occur in. It's unclear, how this would fit into - presumably - a CLDF Wordlist, though. Do we want to put morphemes in a FormTable - or words? Lexical and grammatical concepts could probably go into a ParameterTable, but how would we link to this from FormTable? Have one row per morpheme occurrence in an example - linked to multiple parameters?

I think, the answer to these questions depends on what the resulting dataset is going to be used for - in particular, if it's supposed to be used as is or whether it's going to be the input for another data curation step - maybe on the way to a CLDF Dictionary.

So, rather than implement some sort of generic Corpus.enrichmethod, which adds FormTable and ParameterTable to the CLDF data a corpus was derived from, I'd see pyigt as a tool to be used in cldfbench makecldf code, where extra info about the example corpus at hand can inform the usage.