CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hungarian and Russian scrapes hang

erip opened this issue · comments

Using wikipron with keys hun and rus seem to hang indefinitely. Both have been running for an hour without any movement; see below:

$ wikipron hun > hun.tsv
INFO: Language: 'Hungarian'
INFO: No cut-off date specified

Unfortunately I have very little diagnostic info to offer -- many other languages (including more complicated scrapes like zho) have completed successfully.

Sometimes the server does that. We don't know why. The big scrape (see data/scrape) has logic to resume from hangups, which may be of help to you.

Those are, I think, the two largest languages in the entire collection, though. If it hangs it'll be on one of the two of them.

I'll try replicating your Hungarian example and report back.

Thanks, @kylebgorman! As a workaround I can use the pre-scraped transcriptions from the data/ dir. I mostly file this as documentation for future issue-havers, though it's likely that it isn't really a bug in the client but in the server as you state.

I just remembered something. Both of those languages have narrow (square brackets) pronunciations, almost exclusively. You'll want to add --phonetic to your flags. (Note that this is renamed --narrow in the next release; see #402).

Closing this since I believe my comment a month ago is the explanation for the issue. This is not exactly a bug.