get an WARC archive with all files from a domain
dportabella opened this issue · comments
I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:
$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]
$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc
here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.
I'd recommend also bringing up issues like this on the common crawl mailing list, it'll be seen by a lot more people. In this case, I can answer your question: the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.
thx, I continue the discussion here:
https://groups.google.com/forum/#!topic/common-crawl/0fYTJtFD6Fs