CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Resuming scrape from hang-up point

lmit1 opened this issue · comments

I'm trying to scrape Russian narrow transcriptions, and the code is hanging up ~200k pronunciations in with error:

Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 962, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

I assume this can be attributed to some loss in connection on my part. Is there a way to resume scraping from the point that the code left off? I'm piping the output to a txt file. Or perhaps a way to begin scraping from the end of the alphabet, after which I can cross-reference the two files to get the complete scrape?

Thank you!

Thanks for the quick response. This is my third time trying this scrape; all attempts have resulted in this issue at one point or another. I'll cross my fingers and hope that another try will do it.

The scrape under data/ doesn't appear to have stress markings, which I need for my purposes. Is there a version of it with stressed marked that I'm not seeing?
Thanks!

If you like, I can also run this locally on my own computer since I've had good luck in the past. You are doing something like:

wikipron --narrow --stress rus > rus-narrow-stressed.tsv
```

I presume? 

If you could run it that'd be great. Yes, that's essentially what I was doing.

Thank you for the help!

Thanks @kylebgorman , that would be really helpful! Luca is working with me (just finished). Thanks for the pointer to Yulia's Masters thesis, I assume you're referring to Ruslex, I'll put the link here in case anyone else is looking for a good Russian lexical resource:
https://github.com/undrits/ruslex/

Thanks for the link @msonderegger I couldn't remember her GitHub @.

@lmit1 I am running this now. I'll be traveling for the weekend but I guess if it succeeds it will have by early next week.

This succeeded on first try for me. I think you just have an unreliable connection. (I did it on wired ethernet from my lab...I don't think we've had any non-scheduled downtime in years, so mine is an unusually reliable one.) If you write to me offline with your email address I can send you the file I generated.

Closing issue.