cldellow / cdx

Scala code to interact with the Common Crawl CDX index

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cdx

A subset of https://github.com/ikreymer/cdx-index-client

Designed to make it easier to create subsets of the Common Crawl, for manipulation in other programs.

Usage

# print out 1 200 OK copy of the URL
./fetch CC-MAIN-2018-51 https://kwknittersguild.ca/fair/
# print out 1 200 OK copy of the URL and its first 10 internal links
./one-hop CC-MAIN-2018-51 https://kwknittersguild.ca/fair/
# filter the entries in the provided file (assumes the file was previously
# created via warc-service)
./filter-language eng <filename.zst>

Cleanup

Files are stored in ./cache/{cdx,warc,misc} by default.

You can change the default path of ./cache by overriding the CDX_ROOT environment variable.

About

Scala code to interact with the Common Crawl CDX index


Languages

Language:Shell 63.7%Language:Scala 36.3%