Common Crawl Foundation (commoncrawl)

Common Crawl Foundation

commoncrawl

Organization data from Github https://github.com/commoncrawl

Common Crawl provides an archive of webpages going back to 2007.

Home Page:https://commoncrawl.org

GitHub:@commoncrawl

Twitter:@commoncrawl

Common Crawl Foundation's repositories

cc-pyspark

Process Common Crawl data with Python and Spark

Language:PythonLicense:MITStargazers:445Issues:20Issues:29

cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language:PythonLicense:Apache-2.0Stargazers:197Issues:18Issues:10

cc-index-table

Index Common Crawl archives in tabular format

Language:JavaLicense:Apache-2.0Stargazers:122Issues:13Issues:24

cc-webgraph

Tools to construct and process Common Crawl webgraphs

Language:JavaLicense:Apache-2.0Stargazers:101Issues:11Issues:15

cc-index-server

Common Crawl Index Server

web-languages

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:59Issues:16Issues:2

cc-downloader

A polite and user-friendly downloader for Common Crawl data

Language:RustLicense:Apache-2.0Stargazers:57Issues:9Issues:6

nutch

Common Crawl fork of Apache Nutch

Language:JavaLicense:Apache-2.0Stargazers:38Issues:9Issues:29

whirlwind-python

A whirlwind tour of Common Crawl's data using Python

Language:PythonLicense:Apache-2.0Stargazers:28Issues:10Issues:2

cc-citations

Scientific articles using or citing Common Crawl data

Language:Jupyter NotebookStargazers:27Issues:11Issues:0

language-detection-cld2

Natural language detection, Java bindings for CLD2

Language:JavaLicense:Apache-2.0Stargazers:14Issues:14Issues:4

cc-host-index

Tools for working with the host index

Language:PythonStargazers:11Issues:0Issues:0

ia-web-commons

Web archiving utility library

Language:JavaLicense:Apache-2.0Stargazers:11Issues:6Issues:34

presentations

A collection of public presentations from the Common Crawl Foundation

Stargazers:9Issues:0Issues:0

cc-webgraph-statistics

Statistics of Common Crawl monthly Web Graphs

Language:PythonLicense:Apache-2.0Stargazers:5Issues:6Issues:2

ia-hadoop-tools

Web archiving tools on Hadoop

cc-nutch-example

Apache Nutch example project to archive content in WARC files

Language:ShellLicense:Apache-2.0Stargazers:3Issues:5Issues:0

wac2025-webgraph-workshop

Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025

Language:ShellLicense:MITStargazers:3Issues:2Issues:1

cc-monitoring

Code that monitors Common Crawl infrastructure

Language:PythonStargazers:2Issues:8Issues:0

crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language:JavaLicense:Apache-2.0Stargazers:2Issues:0Issues:0

web-languages-code

The code used to generate templates for the web-languages repo https://github.com/commoncrawl/web-languages

Language:PythonLicense:Apache-2.0Stargazers:2Issues:6Issues:0

cc-index-annotations

Example code to join an annotation to a host or url index

Language:PythonStargazers:1Issues:0Issues:0

wac2025-cc-annotator-poster

A proof of concept pipeline for WARC annotation

Language:RustLicense:Apache-2.0Stargazers:1Issues:3Issues:0

whirlwind-python-notebook

A jupyter notebook illistrating the basics of Common Crawl's datasets.

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:1Issues:0Issues:0

arc2warc-conversion

Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

Stargazers:0Issues:5Issues:0

cc-host-index-media

Media files used in the README.d of cc-host-index

Language:HTMLStargazers:0Issues:0Issues:0

cc-warcinfo-index-builder

Code to build an index that maps warcinfo-id to crawl / warc

Language:PythonStargazers:0Issues:0Issues:0

robotstxt-experiments

How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

Language:Jupyter NotebookLicense:MITStargazers:0Issues:5Issues:0

warcio-s3

Streaming WARC/ARC library for fast web archive IO

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0