CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[pam] Can't parse both types of transcriptions from the same line?

agutkin opened this issue · comments

For Kapampangan(pam) the format of all pronunciation entries looks as follows:

Hyphenation: ba‧tia‧uan
IPA(key): /bəˈtjawən/, [bəˈtjäː.wən]

I suspect we can't parse this when both transcriptions are under the same heading. May be a duplicate.

After configuring and adding the language, the respective scrape comes out as empty.

I don't see the problem from your description. This is exactly what, say, Spanish looks like: <span class="IPA"> with [.*] or /.*/.

Here we have <span class="IPA"> with [.*] and /.*/.

Then I don't know why it's not working.

This issue actually has nothing to do with there being two IPAs on the same line, it's the result of the XPATH template being fed the incorrect parameter for language.

The HTML of the page for "batiauan" contains the following:

<li>
    <a href="/wiki/Wiktionary:International_Phonetic_Alphabet" title="Wiktionary:International Phonetic Alphabet">IPA</a>
    <sup>
         (<a href="https://en.wikipedia.org/wiki/Kapampangan_phonology"
             class="extiw"
             title="wikipedia:Kapampangan phonology">key</a>)
    </sup>
    : 
    <span class="IPA">/bəˈtjawən/</span>
    , 
    <span class="IPA">[bəˈtjäː.wən]</span>
</li>

The XPATH trying to match it is:

(//li|//p)[
  (.|span)[sup[a[
    @title = "Appendix:Pampanga pronunciation"
    or
    @title = "wikipedia:Pampanga phonology"
  ]]]
  and
  span[@class = "IPA"]
  
]

The problem is that the XPATH is trying to find an element with wikipedia:Pampanga phonology as its title when the
element it's looking for has wikipedia:Kapampangan phonology as its title.

The reason for this is that Wikipron incorrectly thinks that the correct name of the language pam is "Pampanga". When you run the command wikipron pam, the function _get_language() tries to find the name of the language associated with the code pam. Because pam isn't listed in languagecodes.py, the function uses the iso639 library to try to find the name associated with pam and comes up with Pampanga instead of Kampampangan. This then gets passed into the XPATH template in config.py, which then gets used to parse the HTML.

I think the best solution would be to fix _get_language() so that it gets the correct language (the one Wiktionary actually uses). We already have a mapping of every language code to its name on Wikipron in languages.json -- why not just use that instead of querying language_codes.py and the iso639 library?

So good question. languages.json has metadata specific to the big scrape; it's not used when one uses the command-line tool nor is it shipped in the wikipron package.

It seems like we need to just update languagecodes.py to include the mappign "pam": "Kapampangan", right?

It seems like we need to just update languagecodes.py to include the mappign "pam": "Kapampangan", right?

Okay, that seems reasonable. I also found 76 examples other than pam that have the same discrepancy between iso639 name and Wiktionary name (I sourced this list using the languages.json file). They have the same problem with their XPATH having the wrong title attribute. I'll update those as well if that makes sense to you.

Nice, yes that sounds good. At least some of the examples there are already in the file, FYI.

Made a PR