warc-extractor.py fails -- "list index out of range"
catharsis71 opened this issue · comments
$ warc-extractor.py
parsing 195.242.99.71-8181-2016-03-23-3324e7c6-00000.warc
Traceback (most recent call last):
File "/home/username/bin/warc-extractor.py", line 200, in __getitem__
return super().__getitem__(name)
File "/home/username/bin/warc-extractor.py", line 83, in __getitem__
return self._d[name.lower()]
KeyError: 'content_type'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/username/bin/warc-extractor.py", line 828, in <module>
parse(args)
File "/home/username/bin/warc-extractor.py", line 713, in parse
inc(record.http, "content_type", "http-content")
File "/home/username/bin/warc-extractor.py", line 654, in inc
obj = obj[header]
File "/home/username/bin/warc-extractor.py", line 204, in __getitem__
return self.content.type
File "/home/username/bin/warc-extractor.py", line 230, in content
self._content = ContentType(string)
File "/home/username/bin/warc-extractor.py", line 267, in __init__
data[test[0]] = test[1]
IndexError: list index out of range
WARC file is from https://archive.org/details/warc-195-242-99-71-8181
other utilities are able to extract at least some data from it
If there's a bad spot in the file (which I'm not sure if there is or not), can there be an option to skip over it and continue processing?
Just trying this script out myself and was able to ignore/log same error by using the -error flag as described in the -h help. Required creating the subdirectory /data. (The -error flag is described in the README.md)
Sorry it's taken this long to respond. I'm doing other things these days, but I am actually fairly interested in what is going on in that warc file and do plan on looking at it.
I fixed the issue. The linked warc file should extract cleanly now. There is still one problem entry. However, it is because the file name is too long and there is not really much I can do about that. Instead, I added a check that should let OS specific save failures to only give a warning and not break the entire data dump.