[pam] Can't parse both types of transcriptions from the same line?

Question

[pam] Can't parse both types of transcriptions from the same line?

agutkin opened this issue 2 years ago · comments

For Kapampangan(pam) the format of all pronunciation entries looks as follows:

Hyphenation: ba‧tia‧uan
IPA(key): /bəˈtjawən/, [bəˈtjäː.wən]

I suspect we can't parse this when both transcriptions are under the same heading. May be a duplicate.

Gabriel Thompson commented 10 months ago

Made a PR

Alexander Gutkin · Answer 1 · Mon Nov 28 2022 00:27:22 GMT+0800 (China Standard Time)

After configuring and adding the language, the respective scrape comes out as empty.

Kyle Gorman · Answer 2 · Mon Nov 28 2022 00:28:22 GMT+0800 (China Standard Time)

I don't see the problem from your description. This is exactly what, say, Spanish looks like: <span class="IPA"> with [.*] or /.*/.

Alexander Gutkin · Answer 3 · Mon Nov 28 2022 04:00:33 GMT+0800 (China Standard Time)

Here we have <span class="IPA"> with [.*] and /.*/.

Kyle Gorman · Answer 4 · Mon Nov 28 2022 04:15:32 GMT+0800 (China Standard Time)

Then I don't know why it's not working.

Gabriel Thompson · Answer 5 · Sun Aug 20 2023 14:02:37 GMT+0800 (China Standard Time)

This issue actually has nothing to do with there being two IPAs on the same line, it's the result of the XPATH template being fed the incorrect parameter for language.

The HTML of the page for "batiauan" contains the following:

<li>
    <a href="/wiki/Wiktionary:International_Phonetic_Alphabet" title="Wiktionary:International Phonetic Alphabet">IPA</a>
    <sup>
         (<a href="https://en.wikipedia.org/wiki/Kapampangan_phonology"
             class="extiw"
             title="wikipedia:Kapampangan phonology">key</a>)
    </sup>
    : 
    <span class="IPA">/bəˈtjawən/</span>
    , 
    <span class="IPA">[bəˈtjäː.wən]</span>
</li>

The XPATH trying to match it is:

(//li|//p)[
  (.|span)[sup[a[
    @title = "Appendix:Pampanga pronunciation"
    or
    @title = "wikipedia:Pampanga phonology"
  ]]]
  and
  span[@class = "IPA"]
  
]

The problem is that the XPATH is trying to find an element with wikipedia:Pampanga phonology as its title when the
element it's looking for has wikipedia:Kapampangan phonology as its title.

The reason for this is that Wikipron incorrectly thinks that the correct name of the language pam is "Pampanga". When you run the command wikipron pam, the function _get_language() tries to find the name of the language associated with the code pam. Because pam isn't listed in languagecodes.py, the function uses the iso639 library to try to find the name associated with pam and comes up with Pampanga instead of Kampampangan. This then gets passed into the XPATH template in config.py, which then gets used to parse the HTML.

I think the best solution would be to fix _get_language() so that it gets the correct language (the one Wiktionary actually uses). We already have a mapping of every language code to its name on Wikipron in languages.json -- why not just use that instead of querying language_codes.py and the iso639 library?

Kyle Gorman · Answer 6 · Mon Aug 21 2023 23:26:22 GMT+0800 (China Standard Time)

So good question. languages.json has metadata specific to the big scrape; it's not used when one uses the command-line tool nor is it shipped in the wikipron package.

It seems like we need to just update languagecodes.py to include the mappign "pam": "Kapampangan", right?

Gabriel Thompson · Answer 7 · Mon Aug 21 2023 23:58:43 GMT+0800 (China Standard Time)

It seems like we need to just update languagecodes.py to include the mappign "pam": "Kapampangan", right?

Okay, that seems reasonable. I also found 76 examples other than pam that have the same discrepancy between iso639 name and Wiktionary name (I sourced this list using the languages.json file). They have the same problem with their XPATH having the wrong title attribute. I'll update those as well if that makes sense to you.

Kyle Gorman · Answer 8 · Tue Aug 22 2023 00:02:17 GMT+0800 (China Standard Time)

Nice, yes that sounds good. At least some of the examples there are already in the file, FYI.