saulpw / readysetdata

Scripts to make specific datasets cleaner and more convenient

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error logs

anjakefala opened this issue · comments

Fixed

title.principles.tsv.gz seems to have been momentarily corrupted. Made a PR with a try/except added, so at least the other tables would get built: #10

anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ time make imdb
scripts/imdb.py -o output
4106s  76.09/408.01MB  (0.02 MB/s)  title.principals.tsv.gzTraceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/imdb.py", line 15, in <module>
    output_imdb('principals', 'title.principals.tsv.gz')
  File "/home/anja/git/readysetdata/scripts/imdb.py", line 9, in output_imdb
    rsd.output('imdb', tblname, rsd.parse_tsv(rsd.gunzip(fp)))
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 20, in output
    with OutputTable(dbname, tblname) as out:
  File "/home/anja/git/readysetdata/readysetdata/utils.py", line 131, in parse_asv
    for line in Progress(it):
  File "/home/anja/git/readysetdata/readysetdata/utils.py", line 71, in __iter__
    for i, x in enumerate(self.iterator):
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
make: *** [Makefile:35: imdb] Error 1
make imdb  1959.46s user 551.53s system 19% cpu 3:37:23.65 total

Edit: title.principals.tsv.gz unzipped fine with gzip.

anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ time make wikidata      
OUTDIR=output/wikidata scripts/wikidata.sh
[6041.3s] 1180688KilledMB  (0.18 MB/s)  latest-all.json.bz2
Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/download.py", line 11, in <module>
    sys.stdout.buffer.write(r)
BrokenPipeError: [Errno 32] Broken pipe
make: *** [Makefile:26: wikidata] Error 137
make wikidata  2253.43s user 305.53s system 42% cpu 1:40:47.28 total

New url: # https://geonames.nga.mil/geonames/GNSData/fc_files/Whole_World.7z

URL and structure of zip have changed

anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ scripts/geonames-nonus.py -o output
Traceback (most recent call last):
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
    conn.connect()
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connection.py", line 414, in connect
    self.sock = ssl_wrap_socket(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/geonames-nonus.py", line 31, in <module>
    } for r in parse_asv(unzip_url(URL).open_text('Countries.txt'))))
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 101, in open_text
    return io.TextIOWrapper(io.BufferedReader(self.open(fn)))
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 81, in open
    f = list(self.matching_files(fn))
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 75, in matching_files
    for f in self.files.values():
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 41, in files
    return {r.filename:r for r in self.infolist()}
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 41, in <dictcomp>
    return {r.filename:r for r in self.infolist()}
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 44, in infolist
    resp = self.http.request('HEAD', self.url)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/request.py", line 74, in request
    return self.request_encode_url(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='geonames.nga.mil', port=443): Max retries exceeded with url: /gns/html/cntyfile/geonames_20220606.zip (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

make movielens

(It successfully completes, but has this one exception near the end)

453s  6.77/125.89MB  (0.01 MB/s)  movie_dataset_public_final/raw/ratings.json

Traceback (most recent call last):
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 24, in output
    r = next(it)
  File "/home/anja/git/readysetdata/scripts/movielens.py", line 48, in <genexpr>
    output('movielens', 'ratings', ({
  File "/home/anja/git/readysetdata/readysetdata/utils.py", line 147, in __iter__
    yield AttrDict(json.loads(line))
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 28 (char 27)
None

0s  0.00/0.36MB  (0.00 MB/s)  movie_dataset_public_final/raw/survey_answers.json
[12.0s] 42100
12s  0.26/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json
[16.5s] 58500
17s  0.36/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json
[16.6s] 58900
17s  0.36/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json


17s  0.36/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json

Fixed

make wikipedia

  File "/home/anja/git/readysetdata/scripts/parse-wikipedia.py", line 15, in <module>
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 16, in outputSingle
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 98, in output
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 99, in <listcomp>
  File "/home/anja/git/readysetdata/readysetdata/jsonl.py", line 29, in output_jsonl
  File "/home/anja/git/readysetdata/readysetdata/jsonl.py", line 9, in __init__
OSError: [Errno 24] Too many open files: 'output/wikipedia_infoboxes/hot_spring.jsonl'
Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 58, in <module>
    main()
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 55, in main
    rdr.parse(sys.stdin)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/xmlreader.py", line 125, in parse
    self.feed(buffer)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
  File "/opt/conda/conda-bld/python-split_1654083059479/work/Modules/pyexpat.c", line 461, in EndElement
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 336, in end_element
    self._cont_handler.endElement(name)
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 44, in endElement
    print(json.dumps(simplify(contents)), file=self.fp)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/download.py", line 11, in <module>
    sys.stdout.buffer.write(r)
BrokenPipeError: [Errno 32] Broken pipe
make: *** [Makefile:21: wikipedia] Error 1
make wikipedia  3230.65s user 17.81s system 106% cpu 50:46.76 total

make wikipedia

3393s  482.54/21132.09MB  (0.14 MB/s)  enwiki-latest-pages-articles-multistream.xml.bz2
bunzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bunzip2: Inappropriate ioctl for device
        Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

[3392.4s] 66704Traceback (most recent call last):
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: no element found: line 13647185, column 1107

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 58, in <module>
    main()
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 55, in main
    rdr.parse(sys.stdin)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/xmlreader.py", line 127, in parse
    self.close()
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 240, in close
    self.feed(b"", isFinal=True)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 221, in feed
    self._err_handler.fatalError(exc)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <stdin>:13647185:1107: no element found
cd output/wikipedia-infoboxes && zip -n .arrow ../wikipedia-infoboxes.zip *.jsonl
/bin/sh: 1: cd: can't cd to output/wikipedia-infoboxes
make: *** [Makefile:22: wikipedia] Error 2