thunderpoot / scdx

A simple tool for querying the Common Crawl CDX

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scdx

Simple Columnar inDeX

PythonRust

A tool for querying the Common Crawl CDX. Versions in both Python and Rust are included in this repository. The command–line syntax is identical in both versions.

Installation:

  1. Clone this repository
  2. To run the Rust version, compile and run via:
$ cargo build --release
$ cd target/release
$ chmod +x scdx

Usage:

$ scdx --sleep 2 --domain commoncrawl.org --crawls CC-MAIN-2021-04 CC-MAIN-2024-10
$ scdx -s 10 -d '*.wikipedia.org' -c CC-MAIN-2023-50
$ scdx -l -d apple.com

The program will display a progress bar and output a file with a timestamp (e.g 2024-02-27_18-34-50_output.jsonl) to the working directory, unless the -o or --output options are used.

The default sleep time is 2 seconds. Please be polite! Polling multiple times a second will make the index server sad. See the CCF system status here.

If no crawls are specified, all crawls will be queried. Use the -l or --latest flag to only query the latest crawl.

The API used supports two methods of wildcarding, like the (more advanced and mature) cdx-toolkit by Greg Lindahl.

  • Prefixed asterisk

    The query *.example.com, in CDX jargon sets matchType='domain', and will return captures for blog.example.com, support.example.com, etc.

  • Appended asterisk

    The query example.com/* will return captures for any page on example.com.

The Python version uses tqdm to display a progress bar, and the Rust version uses indicatif.

Licence

MIT License

Thanks

About

A simple tool for querying the Common Crawl CDX

License:MIT License


Languages

Language:Rust 57.8%Language:Python 42.2%