scraping audio files?
jhdeov opened this issue · comments
Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.
A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of word+transcription+URL, and b) the script bulk-downloads the audio files of the URL.
For example, the Wiktionary page of this word has a link to an audio file.
PS: this could be useful for someone trying out ASR using Wiktionary :D
at least one person has suggested it would be useful to them
This would be very useful to me as well (the "word+transcription+URL" combo). Anything I could help with here?
There's a paper at LREC that seems to do exactly this: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf
If it meets the stated need, I'd say that WikiPron doesn't have to do it, you can just merge whatever you want from WikiPron with that source.
I think I am going to close this as wontfix because I don't seeing us doing this in the near future.