ikreymer / cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

get an WARC archive with all files from a domain

dportabella opened this issue · comments

I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:

$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]

$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc

here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.

I'd recommend also bringing up issues like this on the common crawl mailing list, it'll be seen by a lot more people. In this case, I can answer your question: the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.