[arm] finding IPA transcriptions outside of the Pronunciation block

Question

[arm] finding IPA transcriptions outside of the Pronunciation block

jhdeov opened this issue 2 years ago · comments

For the word կարկանդակ, wikipron finds the correct pronunciation of [kɑɾkɑndɑk] but it also finds the IPA transcriptions of other words in the Usage Notes section like [pɛrɑʃˈki]. I'm not sure if this is an unavoidable glitch from Wikipron's side, or if it's a glitch that could be fixed from the Wiktionary side.

It seems that what's going on is that WikiPron is just finding any IPA transcription that's inside the Armenian entry, even if it's not associated with a dialect. E.g., if you run wikipron arm --dialect='ladygaga' --no-skip-parens --narrow > randos.tsv you get a handful of IPA transcriptions that aren't associated with the pre-defined dialects. These are either a) IPA transcriptions in the Usage notes or etymology, or b) IPA transcriptions for non-standard dialects. This isn't a problem for using Wikipron on a specific language (because the person can just filter those out manually). But I wonder if this glitch causes any other funny business for the other languages.

Side note: I wonder if there's been enough situations where people had to fix Wiktionary entries in order to optimize Wikipron's scraper (like on the various closed issues). If so, perhaps a tips and tricks page would be helpful down the line?

Kyle Gorman · Answer 1 · Mon Nov 07 2022 22:19:46 GMT+0800 (China Standard Time)

It basically finds anything in the pronunciation section in // or []. TBF it is bizarre to be giving the pronunciation of an unrelated Russian word here. I'm going to edit the entry.

The Wiktionary people have taken absolutely zero interest in our project so I don't think there's a demand outside of WikiPron developers for this information.

Hossep Dolatian · Answer 2 · Tue Nov 08 2022 03:31:47 GMT+0800 (China Standard Time)

Heh, the admin ended up agreeing

Hossep Dolatian · Answer 3 · Tue Nov 08 2022 03:34:38 GMT+0800 (China Standard Time)

It basically finds anything in the pronunciation section in // or [].

But then this is a glitch though because the Russian word was not under the pronunciation section but under a separate heading. The original example is gone now, but another example is գրաբար. The usage notes explain a pronunciation tidbit. It's in a separate section, but it's getting scraped too.

Kyle Gorman · Answer 4 · Tue Nov 08 2022 03:41:52 GMT+0800 (China Standard Time)

Yes, that was a surprise to me that it did that all the same.

…

On Mon, Nov 7, 2022 at 2:34 PM Hossep Dolatian ***@***.***> wrote: It basically finds anything in the pronunciation section in // or []. But then this is a glitch though because the Russian word was *not* under the pronunciation section but under a separate heading. The original example is gone now, but another example is գրաբար <https://en.m.wiktionary.org/wiki/%D5%A3%D6%80%D5%A1%D5%A2%D5%A1%D6%80>. The usage notes explain a pronunciation tidbit. It's in a separate section, but it's getting scraped too. — Reply to this email directly, view it on GitHub <#470 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OJOT6KGN4QW7QZN4ULWHFKVRANCNFSM6AAAAAARYXOKXI> . You are receiving this because you commented.Message ID: ***@***.***>

Hossep Dolatian · Answer 5 · Tue Nov 08 2022 03:50:51 GMT+0800 (China Standard Time)

Wikipron also found IPAs that were in the etymology section, before the pronunciation section. This word had a transcription there until I found and removed it (via the above 'fake dialect' trick).

This makes me think that Wikipron is looking IPA anywhere in the entry, and not just in the pronunciation box. I'm not sure if that's an error (because the code isn't designed to go out of the pronunciation box) or a missing feature (because the code is designed to go out of the pronunciation box).