get an WARC archive with all files from a domain

Question

get an WARC archive with all files from a domain

dportabella opened this issue 8 years ago · comments

I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:

$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]

$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc

here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.

Greg Lindahl · Answer 1 · Thu Sep 08 2016 06:00:10 GMT+0800 (China Standard Time)

I'd recommend also bringing up issues like this on the common crawl mailing list, it'll be seen by a lot more people. In this case, I can answer your question: the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.

David Portabella · Answer 2 · Thu Sep 08 2016 17:48:10 GMT+0800 (China Standard Time)

thx, I continue the discussion here:
https://groups.google.com/forum/#!topic/common-crawl/0fYTJtFD6Fs