CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JSON UniMorph morphology: make one-to-many

kylebgorman opened this issue · comments

Once #372 by @reubenraff is submitted, we should expand the JSON file and associated logic so that the mapping from WikiPron language codes to UniMorph URLs is one-to-many. This will allow us to deal with the fact that fin (Finnish) is two files.

  1. Instead of Dict[str, str] make the UniMorph JSON a Dict[str, List[str]] instead. Most languages will only have one entry in the list.
  2. In the grab_unimorph_data.py script, loop over the list of URLs for each language, writing all of them into the WikiPron language code + .tsv. That'll give us a single fin.tsv file (for instance).

We could make it the case that the dictionary values are polymorphic (Union[str, List[str]]) but I think that'd just make things more confusing.