Common Crawl Foundation

commoncrawl

Organization data from Github https://github.com/commoncrawl

Common Crawl provides an archive of webpages going back to 2007.

https://commoncrawl.org

Common Crawl Foundation's repositories

cc-pyspark

Process Common Crawl data with Python and Spark

Language:PythonMIT445 20 29

cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language:PythonApache-2.0197 18 10

cc-index-table

Index Common Crawl archives in tabular format

Language:JavaApache-2.0122 13 24

cc-webgraph

Tools to construct and process Common Crawl webgraphs

Language:JavaApache-2.0101 11 15

cc-index-server

Common Crawl Index Server

Language:HTML70 6 10

web-languages

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language:Jupyter NotebookApache-2.059 16 2

cc-downloader

A polite and user-friendly downloader for Common Crawl data

Language:RustApache-2.057 9 6

nutch

Common Crawl fork of Apache Nutch

Language:JavaApache-2.038 9 29

whirlwind-python

A whirlwind tour of Common Crawl's data using Python

Language:PythonApache-2.028 10 2

cc-citations

Scientific articles using or citing Common Crawl data

Language:Jupyter Notebook27 110

language-detection-cld2

Natural language detection, Java bindings for CLD2

Language:JavaApache-2.014 14 4

cc-host-index

Tools for working with the host index

Language:Python1100

ia-web-commons

Web archiving utility library

Language:JavaApache-2.011 6 34

presentations

A collection of public presentations from the Common Crawl Foundation

900

cc-webgraph-statistics

Statistics of Common Crawl monthly Web Graphs

Language:PythonApache-2.05 6 2

ia-hadoop-tools

Web archiving tools on Hadoop

Language:Java4 5 6

cc-nutch-example

Apache Nutch example project to archive content in WARC files

Language:ShellApache-2.03 50

wac2025-webgraph-workshop

Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025

Language:ShellMIT3 2 1

cc-monitoring

Code that monitors Common Crawl infrastructure

Language:Python2 80

crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language:JavaApache-2.0200

web-languages-code

The code used to generate templates for the web-languages repo https://github.com/commoncrawl/web-languages

Language:PythonApache-2.02 60

cc-index-annotations

Example code to join an annotation to a host or url index

Language:Python100

wac2025-cc-annotator-poster

A proof of concept pipeline for WARC annotation

Language:RustApache-2.01 30

whirlwind-python-notebook

A jupyter notebook illistrating the basics of Common Crawl's datasets.

Language:Jupyter NotebookApache-2.0100

arc2warc-conversion

Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

050

cc-host-index-media

Media files used in the README.d of cc-host-index

Language:HTML000

cc-warcinfo-index-builder

Code to build an index that maps warcinfo-id to crawl / warc

Language:Python000

robotstxt-experiments

How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

Language:Jupyter NotebookMIT050

warcio-s3

Streaming WARC/ARC library for fast web archive IO

Language:PythonApache-2.0000