recrm / ArchiveTools

A collection of tools for archiving and analysing the internet.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

warc-extractor.py fails -- "list index out of range"

catharsis71 opened this issue · comments

$ warc-extractor.py
parsing 195.242.99.71-8181-2016-03-23-3324e7c6-00000.warc
Traceback (most recent call last):
  File "/home/username/bin/warc-extractor.py", line 200, in __getitem__
    return super().__getitem__(name)
  File "/home/username/bin/warc-extractor.py", line 83, in __getitem__
    return self._d[name.lower()]
KeyError: 'content_type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/bin/warc-extractor.py", line 828, in <module>
    parse(args)
  File "/home/username/bin/warc-extractor.py", line 713, in parse
    inc(record.http, "content_type", "http-content")
  File "/home/username/bin/warc-extractor.py", line 654, in inc
    obj = obj[header]
  File "/home/username/bin/warc-extractor.py", line 204, in __getitem__
    return self.content.type
  File "/home/username/bin/warc-extractor.py", line 230, in content
    self._content = ContentType(string)
  File "/home/username/bin/warc-extractor.py", line 267, in __init__
    data[test[0]] = test[1]
IndexError: list index out of range

WARC file is from https://archive.org/details/warc-195-242-99-71-8181

other utilities are able to extract at least some data from it

If there's a bad spot in the file (which I'm not sure if there is or not), can there be an option to skip over it and continue processing?

commented

Just trying this script out myself and was able to ignore/log same error by using the -error flag as described in the -h help. Required creating the subdirectory /data. (The -error flag is described in the README.md)

Sorry it's taken this long to respond. I'm doing other things these days, but I am actually fairly interested in what is going on in that warc file and do plan on looking at it.

#9

I fixed the issue. The linked warc file should extract cleanly now. There is still one problem entry. However, it is because the file name is too long and there is not really much I can do about that. Instead, I added a check that should let OS specific save failures to only give a warning and not break the entire data dump.