CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[arm] finding IPA transcriptions outside of the Pronunciation block

jhdeov opened this issue · comments

For the word կարկանդակ, wikipron finds the correct pronunciation of [kɑɾkɑndɑk] but it also finds the IPA transcriptions of other words in the Usage Notes section like [pɛrɑʃˈki]. I'm not sure if this is an unavoidable glitch from Wikipron's side, or if it's a glitch that could be fixed from the Wiktionary side.

It seems that what's going on is that WikiPron is just finding any IPA transcription that's inside the Armenian entry, even if it's not associated with a dialect. E.g., if you run wikipron arm --dialect='ladygaga' --no-skip-parens --narrow > randos.tsv you get a handful of IPA transcriptions that aren't associated with the pre-defined dialects. These are either a) IPA transcriptions in the Usage notes or etymology, or b) IPA transcriptions for non-standard dialects. This isn't a problem for using Wikipron on a specific language (because the person can just filter those out manually). But I wonder if this glitch causes any other funny business for the other languages.

Side note: I wonder if there's been enough situations where people had to fix Wiktionary entries in order to optimize Wikipron's scraper (like on the various closed issues). If so, perhaps a tips and tricks page would be helpful down the line?

It basically finds anything in the pronunciation section in // or []. TBF it is bizarre to be giving the pronunciation of an unrelated Russian word here. I'm going to edit the entry.

The Wiktionary people have taken absolutely zero interest in our project so I don't think there's a demand outside of WikiPron developers for this information.

Heh, the admin ended up agreeing image

It basically finds anything in the pronunciation section in // or [].

But then this is a glitch though because the Russian word was not under the pronunciation section but under a separate heading. The original example is gone now, but another example is գրաբար. The usage notes explain a pronunciation tidbit. It's in a separate section, but it's getting scraped too.

Wikipron also found IPAs that were in the etymology section, before the pronunciation section. This word had a transcription there until I found and removed it (via the above 'fake dialect' trick).

This makes me think that Wikipron is looking IPA anywhere in the entry, and not just in the pronunciation box. I'm not sure if that's an error (because the code isn't designed to go out of the pronunciation box) or a missing feature (because the code is designed to go out of the pronunciation box).