CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mismatch warnings from codes.py

kylebgorman opened this issue · comments

The instructions for the big scrape tell the user to first run codes.py. This does language code lookup and logs (as warnings) any inconsistencies between the ISO-639 name and what Wiktionary calls them. Some of these are clearly harmless, some of them are just wrong:

codes.py WARNING: WikiPron resolves the key 'ain' to 'Ainu (Japan)' listed as 'Ainu' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'rup' to 'Macedo-Romanian' listed as 'Aromanian' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'bjb' to 'Banggarla' listed as 'Barngarla' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'tzm' to 'Central Atlas Tamazight' listed as 'Central Franconian' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'nya' to 'Nyanja' listed as 'Chichewa' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'kat' to 'Georgian' listed as 'German Low German' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'grn' to 'Guarani' listed as 'Guaraní' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'hat' to 'Haitian' listed as 'Haitian Creole' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'jje' to 'Jeju' listed as 'Jersey Dutch' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'pam' to 'Pampanga' listed as 'Kapampangan' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'kok' to 'Konkani (macrolanguage)' listed as 'Konkani' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'lou' to 'Louisiana Creole' listed as 'Louisiana Creole French' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'msa' to 'Malay (macrolanguage)' listed as 'Malay' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'nep' to 'Nepali (macrolanguage)' listed as 'Nepali' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'nup' to 'Nupe-Nupe-Tako' listed as 'Nupe' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'orv' to 'Old Russian' listed as 'Old East Slavic' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'non' to 'Old Norse' listed as 'Old Portuguese' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'ori' to 'Oriya (macrolanguage)' listed as 'Oriya' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'por' to 'Portuguese' listed as 'Proto-Austronesian' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'por' to 'Portuguese' listed as 'Proto-Brythonic' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'por' to 'Portuguese' listed as 'Proto-Germanic' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'por' to 'Portuguese' listed as 'Proto-Malayic' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'por' to 'Portuguese' listed as 'Proto-Malayo-Polynesian' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'por' to 'Portuguese' listed as 'Proto-Ryukyuan' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'rap' to 'Rapanui' listed as 'Rapa Nui' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'rom' to 'Romany' listed as 'Romani' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'sdc' to 'Sassarese Sardinian' listed as 'Sassarese' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'stq' to 'Saterland Frisian' listed as 'Scanian' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'sin' to 'Sinhala' listed as 'Sinhalese' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'swa' to 'Swahili (macrolanguage)' listed as 'Swahili' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'tkl' to 'Tokelau' listed as 'Tokelauan' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'lcp' to 'Western Lawa' listed as 'Westrobothnian' on Wiktionary

The thing that neither the README nor the warning tells you is what to do about these mismatches. I don't want Old Norse and Old Portuguese to be confused: what am I supposed to do about this entry it generates in languages.json?

    "non": {
        "iso639_name": "Old Norse",
        "wiktionary_name": "Old Portuguese",
        "wiktionary_code": "roa-opt",
        "script": {
            "latn": "Latin"
        }
    },

Hi @jacksonllee do you have any suggestions here? I ran into this the other day and am not sure what the answer is.

I wasn't the main author of this codes.py code, but if I'm reading it right now, a likely source of issues is either the way we scrape for the "wiktionary code", or the "wiktionary code" itself. FWIW, I vaguely recall Wiktionary has a funny way of assigning codes, particularly to Old X as well as proto-languages (which would seem to explain a good portion of what you've observed here). More digging needed...

Separate from dealing with the Wiktionary code scraping, I think what's missing here is instructions to the user. What is the user supposed to do about that entry? Is it a problem? It looks to me like it is but I'm not sure.

I re-ran codes.py and got the following output:

codes.py WARNING: WikiPron resolves the key 'ain' to 'Ainu (Japan)' listed as 'Ainu' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'rup' to 'Macedo-Romanian' listed as 'Aromanian' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'bjb' to 'Banggarla' listed as 'Barngarla' on Wiktionary
codes.py WARNING: Could not find language with code gmw-cfr
codes.py WARNING: WikiPron resolves the key 'nya' to 'Nyanja' listed as 'Chichewa' on Wiktionary
codes.py WARNING: Could not find language with code nds-de
codes.py WARNING: WikiPron resolves the key 'grn' to 'Guarani' listed as 'Guaraní' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'hat' to 'Haitian' listed as 'Haitian Creole' on Wiktionary
codes.py WARNING: Could not find language with code gmw-jdt
codes.py WARNING: WikiPron resolves the key 'ktz' to 'Juǀʼhoan' listed as "Juǀ'hoan" on Wiktionary
codes.py WARNING: WikiPron resolves the key 'pam' to 'Pampanga' listed as 'Kapampangan' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'kok' to 'Konkani (macrolanguage)' listed as 'Konkani' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'kwk' to 'Kwakiutl' listed as "Kwak'wala" on Wiktionary
codes.py WARNING: WikiPron resolves the key 'msa' to 'Malay (macrolanguage)' listed as 'Malay' on Wiktionary
codes.py WARNING: Could not find language with code grk-mar
codes.py WARNING: WikiPron resolves the key 'nep' to 'Nepali (macrolanguage)' listed as 'Nepali' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'nup' to 'Nupe-Nupe-Tako' listed as 'Nupe' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'orv' to 'Old Russian' listed as 'Old East Slavic' on Wiktionary
codes.py WARNING: Could not find language with code roa-opt
codes.py WARNING: WikiPron resolves the key 'kaw' to 'Kawi' listed as 'Old Javanese' on Wiktionary
codes.py WARNING: Could not find language with code zlw-opl
codes.py WARNING: WikiPron resolves the key 'ori' to 'Oriya (macrolanguage)' listed as 'Oriya' on Wiktionary
codes.py WARNING: Could not find language with code map-pro
codes.py WARNING: Could not find language with code cel-bry-pro
codes.py WARNING: Could not find language with code gem-pro
codes.py WARNING: Could not find language with code jpx-pro
codes.py WARNING: Could not find language with code poz-mly-pro
codes.py WARNING: Could not find language with code poz-pro
codes.py WARNING: Could not find language with code jpx-ryu-pro
codes.py WARNING: WikiPron resolves the key 'rap' to 'Rapanui' listed as 'Rapa Nui' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'rom' to 'Romany' listed as 'Romani' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'sdc' to 'Sassarese Sardinian' listed as 'Sassarese' on Wiktionary
codes.py WARNING: Could not find language with code gmq-scy
codes.py WARNING: WikiPron resolves the key 'sin' to 'Sinhala' listed as 'Sinhalese' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'swa' to 'Swahili (macrolanguage)' listed as 'Swahili' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'tkl' to 'Tokelau' listed as 'Tokelauan' on Wiktionary
codes.py WARNING: WikiPron resolves the key 'srs' to 'Sarsi' listed as "Tsuut'ina" on Wiktionary
codes.py WARNING: WikiPron resolves the key 'yua' to 'Yucateco' listed as 'Yucatec Maya' on Wiktionary

AFAIK these are all correct resolutions. It seems like something changed on Wiktionary's end where instead of non referring to Old Portuguese (as it used to) it now correctly refers to Old Norse while roa-opt refers to Old Portuguese.

Even more confusingly, I can't find an older revision of the list of language codes on Wiktionary where non was listed as the code for Old Portuguese, so I really have no idea why the API resolved the code that way.

It seems like this isn't an issue any more, at least for now. I suppose Wiktionary's version of the ISO-639 language codes could be theoretically messed up in the future. We could still allow some way for the user to override Wikipron resolving languages if the user doesn't think the resolutions are correct, but I'm not totally sure if this is necessary.

Must have been a transient issue. Let's close it for now. I'll remember it in case we have to return to it.