There are 6 repositories under commoncrawl topic.
news-please - an integrated web crawler and information extractor for news that just works
Process Common Crawl data with Python and Spark
News crawling with StormCrawler - stores content as WARC
A python utility for downloading Common Crawl data
Price Crawler - Tracking Price Inflation
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Paskto - Passive Web Scanner
:spider: The pipeline for the OSCAR corpus
Extract web archive data using Wayback Machine and Common Crawl
Statistics of Common Crawl monthly archives mined from URL index files
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
Index Common Crawl archives in tabular format
Tools to construct and process webgraphs from Common Crawl data
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载
Various Jupyter notebooks about Common Crawl data
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
Simple multi threaded tool to extract domain related data from commoncrawl.org
来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载
A News Article Collection Library
Python tools to retrieve text from CommonCrawl WARC files based on cdx index.
super-Django-CC is a simle web interface for commoncrawl.org
builds a tantivy index from common crawl warc.wet files
Common Crawl's processing tools
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Crawls the web to generate a huge dataset for training
This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".
Collected data about from three sources, one opinion-based social media in twitter, research data in New York Times, and the third is the common crawl data for the same topic or key phrase, and from similar time periods. Processed the three data sets collected individually using classical big data methods like Map Reduce in Google Dataproc Clusters. And then compared the outcomes using popular visualization methods in tableau.