commoncrawl

There are 8 repositories under commoncrawl topic.

news-please
fhamborg / news-please
news-please - an integrated web crawler and information extractor for news that just works
news-crawler news-extractor crawler extractor news news-websites elasticsearch json python nlp data-gathering news-archive news-articles commoncrawl extract-articles extract-information news-scraper ccnews cc-news roberta
Language:Python 2342
commoncrawl / cc-pyspark
Process Common Crawl data with Python and Spark
spark warc-files wet commoncrawl sparksql pyspark wat-files common-crawl
Language:Python 442
flairNLP / fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping corpus-tools datasets image-classification image-extraction
Language:Python 414
commoncrawl / news-crawl
News crawling with StormCrawler - stores content as WARC
crawler news warc web-crawler apache-storm common-crawl commoncrawl storm-crawler
Language:Java 358
commoncrawl / cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
common-crawl commoncrawl statistics
Language:Python 194
uhussain / WebCrawlerForInflation
Price Crawler - Tracking Price Inflation
aws-athena spark python3 s3-storage pandas-dataframe parquet-files commoncrawl petabytes calculate-inflation-rates dash
Language:Python 187
cocrawler / cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
web-archiving web-archives warc cdx cdx-api commoncrawl python
Language:Python 186
oscar-project / ungoliant
:spider: The pipeline for the OSCAR corpus
common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar
Language:Rust 171
commoncrawl / cc-mrjob
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
python map-reduce hadoop commoncrawl
Language:Python 166
karust / gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
golang crawler commoncrawl wayback-machine webarchive concurrency
Language:Go 161
cloudtracer / paskto
Paskto - Passive Web Scanner
nikto internet-of-things scanner internetarchive commoncrawl passive-vulnerability-scanner osint
Language:JavaScript 152
shjwudp / c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
commoncrawl dataset massivetext nlp python spark
Language:Python 131
commoncrawl / cc-index-table
Index Common Crawl archives in tabular format
columnar-storage sql spark aws-athena commoncrawl apache-parquet
Language:Java 122
commoncrawl / cc-webgraph
Tools to construct and process Common Crawl webgraphs
webgraph-framework common-crawl pagerank centrality-measures webgraph commoncrawl
Language:Java 101
centic9 / CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
cdx-files commoncrawl mime-types warc java
Language:Java 71
generals-space / site-mirror-py
[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载
commoncrawl crawler mirror spider
Language:Python 69
commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
jupyter-notebook common-crawl aws-athena webarchiving commoncrawl webgraph-framework
Language:Jupyter Notebook 59
CI-Research / KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
commoncrawl wordcount keyword-extraction cluster-analysis
57
commoncrawl / cc-downloader
A polite and user-friendly downloader for Common Crawl data
commoncrawl downloader rust
Language:Rust 57
rix4uni / uforall
uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl
bugbounty crawler osint alienvault commoncrawl recon reconnaissance urlscan wayback
Language:Go 42
commoncrawl / nutch
Common Crawl fork of Apache Nutch
big-data commoncrawl hadoop java web-crawler
Language:Java 39
commoncrawl / cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
mapreduce java hadoop commoncrawl
Language:Java 37
ChrisCates / CommonCrawler
🕸 A simple way to extract data from Common Crawl
golang commoncrawl
Language:Go 34
Damian89 / commonCrawlParser
Simple multi threaded tool to extract domain related data from commoncrawl.org
commoncrawl osint pentesting
Language:Python 33
generals-space / site-mirror-go
来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载
commoncrawl crawler mirror spider
Language:Go 31
networkdynamics / seldonite
A News Article Collection Library
python events news news-aggregator news-articles nlp commoncrawl spark
Language:Python 22
cisnlp / GlotCC
🕸 GlotCC Dataset and Pipline -- NeurIPS 2024
glotcc common-crawl commoncrawl corpus-linguistics crawler glot language-identification multlingual multilingual-dataset glotlid
Language:Jupyter Notebook 20
lxucs / commoncrawl-warc-retrieval
Python tools to retrieve text from CommonCrawl WARC files based on cdx index.
cdx commoncrawl text-retrieval
Language:Python 18
imfht / super-Django-CC
super-Django-CC is a simle web interface for commoncrawl.org
commoncrawl security-tools subdomain-scanner
Language:Python 13
Tarasa24 / PWA-Store
The largest collection of publicly accessible Progressive Web Apps*
commoncrawl crawler emr golang mrjob postgresql puppeteer pwa linode
Language:HTML 13
ahcm / tantivy_warc_indexer
builds a tantivy index from common crawl warc.wet files
commoncrawl tantivy search index
Language:Rust 12
toimik / CommonCrawl
Common Crawl's processing tools
common-crawl common-crawl-data commoncrawl warc warc-files wat wat-files wet wet-files
Language:C# 11
code402 / warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
commoncrawl warc common-crawl
Language:Shell 8
thunderpoot / cc-getpage
Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
common-crawl common-crawl-data common-crawl-python common-crawl-with-python commoncrawl
Language:Python 6
astralway / webindex
Apache Fluo application that creates a web index using Common Crawl data
fluo accumulo commoncrawl
Language:Java 4
ngc7292 / query_of_cc
This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".
commoncrawl knowledge language-model llm pile pre-training
4

commoncrawl

fhamborg / news-please

commoncrawl / cc-pyspark

flairNLP / fundus

commoncrawl / news-crawl

commoncrawl / cc-crawl-statistics

uhussain / WebCrawlerForInflation

cocrawler / cdx_toolkit

oscar-project / ungoliant

commoncrawl / cc-mrjob

karust / gogetcrawl

cloudtracer / paskto

shjwudp / c4-dataset-script

commoncrawl / cc-index-table

commoncrawl / cc-webgraph

centic9 / CommonCrawlDocumentDownload

generals-space / site-mirror-py

commoncrawl / cc-notebooks

CI-Research / KeywordAnalysis

commoncrawl / cc-downloader

rix4uni / uforall

commoncrawl / nutch

commoncrawl / cc-warc-examples

ChrisCates / CommonCrawler

Damian89 / commonCrawlParser

generals-space / site-mirror-go

networkdynamics / seldonite

cisnlp / GlotCC

lxucs / commoncrawl-warc-retrieval

imfht / super-Django-CC

Tarasa24 / PWA-Store

ahcm / tantivy_warc_indexer

toimik / CommonCrawl

code402 / warc-benchmark

thunderpoot / cc-getpage

astralway / webindex

ngc7292 / query_of_cc