CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scraping audio files?

jhdeov opened this issue · comments

Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.

A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of word+transcription+URL, and b) the script bulk-downloads the audio files of the URL.

For example, the Wiktionary page of this word has a link to an audio file.

PS: this could be useful for someone trying out ASR using Wiktionary :D

commented

at least one person has suggested it would be useful to them

This would be very useful to me as well (the "word+transcription+URL" combo). Anything I could help with here?

There's a paper at LREC that seems to do exactly this: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf

If it meets the stated need, I'd say that WikiPron doesn't have to do it, you can just merge whatever you want from WikiPron with that source.

I think I am going to close this as wontfix because I don't seeing us doing this in the near future.