scraping audio files?

Question

scraping audio files?

jhdeov opened this issue 2 years ago · comments

Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.

Kyle Gorman · Answer 1 · Thu Jun 23 2022 07:08:06 GMT+0800 (China Standard Time)

Probably, and at least one person has suggested it would be useful to them. (I myself don't have a use yet but I like the idea.) I wonder if this would exceed what we can store on GitHub directly (just in terms of overall repo size, I think the limit is 5 GB), though, and if so we would have to do something like make a local download then upload to, IDK, S3 or something like that and generate a link.

…

On Mon, Jun 20, 2022 at 7:47 PM Hossep Dolatian ***@***.***> wrote: Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation <https://en.wiktionary.org/wiki/%D5%A3%D6%80%D5%A5%D5%AC> section. — Reply to this email directly, view it on GitHub <#466>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OJEZDPIFZU7XORZJILVQEUMVANCNFSM5ZKXAIRA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Hossep Dolatian · Answer 2 · Thu Jun 23 2022 07:16:19 GMT+0800 (China Standard Time)

A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of word+transcription+URL, and b) the script bulk-downloads the audio files of the URL.

For example, the Wiktionary page of this word has a link to an audio file.

PS: this could be useful for someone trying out ASR using Wiktionary :D

Kyle Gorman · Answer 3 · Thu Jun 23 2022 07:17:52 GMT+0800 (China Standard Time)

That's a good idea. The person who is probably most in the market for this is Alan Black at CMU.

…

On Wed, Jun 22, 2022 at 4:16 PM Hossep Dolatian ***@***.***> wrote: A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of worth+transcription+URL, and b) the script bulk-downloads the audio files of the URL. For example, the Wiktionary page of this word <https://en.m.wiktionary.org/wiki/%D5%A3%D6%80%D5%A5%D5%AC> has a link <https://en.m.wiktionary.org/wiki/File:Hy-%D5%A3%D6%80%D5%A5%D5%AC.ogg> to an audio file. I know that the way Wiktionary works is that users independently upload their audio recordings into some Wiki site. And then Wiktionary links a Wiktionary entry (if it exists) to the audio file (if it exists). That way, a user can make an audio file let's say today, but then next week they make the Wiktionary entry, and then Wiktionary will link the entry with the audio file. — Reply to this email directly, view it on GitHub <#466 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OPWRCRIQ3TOBEJW343VQONE3ANCNFSM5ZKXAIRA> . You are receiving this because you commented.Message ID: ***@***.***>

rovr · Answer 4 · Wed Jun 29 2022 01:57:20 GMT+0800 (China Standard Time)

at least one person has suggested it would be useful to them

This would be very useful to me as well (the "word+transcription+URL" combo). Anything I could help with here?

Kyle Gorman · Answer 5 · Wed Jun 29 2022 02:21:33 GMT+0800 (China Standard Time)

There's a paper at LREC that seems to do exactly this: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf

If it meets the stated need, I'd say that WikiPron doesn't have to do it, you can just merge whatever you want from WikiPron with that source.

Kyle Gorman · Answer 6 · Fri Mar 31 2023 02:20:51 GMT+0800 (China Standard Time)

I think I am going to close this as wontfix because I don't seeing us doing this in the near future.