oduwsdl / hypercane

A toolkit for developing algorithms that sample mementos from a web archive collection.

Home Page:https://oduwsdl.github.io/hypercane

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Synthesize warc using regular vs raw stream

lesleyodu opened this issue · comments

The synthesize warc command will unintentionally switch back to the original stream instead of the raw stream. The bug seems to be resolved by making deep copies of all variables from the original stream.

Affected lines in hypercane/hypercane/synthesize/warcs.py:
76 - headers_list = copy.deepcopy(resp.raw.headers.items())
81 - warc_target_uri = str(resp.links[link]['url'])
88 - mdt = str(resp.headers['memento-datetime'])

Thank you for this. I'll fix it soon.

Update - Unfortunately I am seeing that hypercane is still switching streams for just some warcs even with these changes - will let you know if I find more code edits to make to fix this issue.

Add after line 60 in syntheisze/warcs.py:

if 'rel' in link.attrs and 'stylesheet' in link['rel']: