common-crawl

There are 9 repositories under common-crawl topic.

StringZilla
ashvardanian / StringZilla
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring
Language:C++ 2019
commoncrawl / cc-pyspark
Process Common Crawl data with Python and Spark
spark warc-files wet commoncrawl sparksql pyspark wat-files common-crawl
Language:Python 400
commoncrawl / news-crawl
News crawling with StormCrawler - stores content as WARC
crawler news warc web-crawler apache-storm common-crawl commoncrawl storm-crawler
Language:Java 314
michaelharms / comcrawl
A python utility for downloading Common Crawl data
commoncrawl python data common-crawl scraping deep-learning training-dataset
Language:Python 219
oscar-project / ungoliant
:spider: The pipeline for the OSCAR corpus
commoncrawl oscar crawler language-classification corpus-linguistics nlp fasttext common-crawl
Language:Rust 158
commoncrawl / cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
commoncrawl statistics common-crawl
Language:Python 139
crissyfield / troll-a
Drill into WARC web archives
command-line-tool common-crawl internet-archive security security-tools warc
Language:Go 133
oscar-project / goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
nlp corpus-linguistics language-classification common-crawl fasttext
Language:Go 85
commoncrawl / cc-webgraph
Tools to construct and process webgraphs from Common Crawl data
webgraph-framework common-crawl pagerank centrality-measures webgraph commoncrawl
Language:Java 76
commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
jupyter-notebook common-crawl aws-athena webarchiving commoncrawl webgraph-framework
Language:Jupyter Notebook 45
IBM / cc-dbp
A dataset for knowledge base population research using Common Crawl and DBpedia.
knowledge-base-population dbpedia common-crawl ibm-research-ai
Language:Java 28
bminixhofer / gerpt2
German small and large versions of GPT2.
gpt2 german common-crawl machine-learning language-model nlp
Language:Python 18
cisnlp / GlotCC
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages -- under review
glotcc common-crawl commoncrawl corpus-linguistics crawler glot language-identification multlingual multilingual-dataset
Language:Jupyter Notebook 11
oscar-project / oscar-website
The website of the Oscar Project
common-crawl hugo language-model machine-learning nlp website
Language:TeX 10
HRN-Projects / common_crawl_with_scrapy
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
common-crawl-with-scrapy common-crawl parse-common-crawl common-crawl-with-python python3 python scrapy web-scraping web-crawling data-mining common-crawl-scrapy common-crawl-python common-crawl-data webarchive webarchive-data-scraping
Language:Python 7
Mgosi / Big-Data-Analysis-using-MapReduce-in-Hadoop
We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.
big-data big-data-analytics tableau common-crawl twitter-api tweet-collector data-pipeline hadoop-docker docker hdfs data-processing
Language:Jupyter Notebook 7
toimik / CommonCrawl
Common Crawl's processing tools
commoncrawl common-crawl common-crawl-data warc-files wat-files wet-files warc wat wet
Language:C# 7
hrbrmstr / cc
⛏Extract metadata of a specific target based on the results of "commoncrawl.org"
r rstats common-crawl domains urls reconnaissance recon r-cyber
Language:R 6
tokenmill / common-crawl-utils
Various Common Crawl utilities in Clojure.
cdx-api clojure clojure-library common-crawl warc
Language:Clojure 6
alumik / common-crawl-downloader
Distributed download scripts for Common Crawl data
common-crawl downloader
Language:Python 4
code402 / warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
commoncrawl warc common-crawl
Language:Shell 4
ilyankou / cc-gpx
CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
common-crawl gpx hiking
Language:Jupyter Notebook 3
connor-marchand / gau-python
This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau
alienvault common-crawl scraper wayback-machine gau-python
Language:Python 2
mwoss / mors
Application of topic models for information retrieval and search engine optimization.
search-engine search python django crawler doc2vec lda tfidf gensim common-crawl scrapy hacktoberfest
Language:Python 2
neil-zt / common-crawl-client
A Common Crawl client example for scraping specific websites.
common-crawl scraping-python comcrawl
Language:Jupyter Notebook 2
socket-var / nyt-twitter-cc-hadoop
Perform big data analysis on New york times, Twitter and Common Crawl APIs
hadoop-mapreduce nyt-api twitter-api common-crawl
Language:Jupyter Notebook 2
fizerkhan / cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
common-crawl
Language:Python 1
fizerkhan / CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
common-crawl
Language:Java 1
fizerkhan / KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
common-crawl
Language:Python 1
bottomless-archive-project / common-crawl-client
This library is a very lightweight client to Common Crawl's WARC files.
common-crawl warc
Language:Java 0
bottomless-archive-project / url-collector
An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
crawler common-crawl url-crawler
Language:Java 0
Dahouabdelhalim / Discourse-marksers-and-Web-crawling
Discourse Markers identification in French Language
discourse-markers web-crawling deep-learning french-language dataset unitexgramlab common-crawl
Language:HTML 0
hadrianw / abracabra
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
search-engine rust rust-lang warc common-crawl adblock adblocking
Language:Rust 0
skyler-myers-db / Common-Crawl-Analysis
Parsing the common crawl database using Scala and Spark
spark scala emr emr-cluster s3 s3-bucket common-crawl big-data
Language:Scala
srmocher / fake-science
Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)
fake-news deep-learning common-crawl
Language:Python
Vikasg7 / warc-reader
ES6 Class to read .warc or .warc.gz file member by member in nodejs
warc warc-reader common-crawl warc-record generator yield next warc-headers nodejs
Language:TypeScript

common-crawl

ashvardanian / StringZilla

commoncrawl / cc-pyspark

commoncrawl / news-crawl

michaelharms / comcrawl

oscar-project / ungoliant

commoncrawl / cc-crawl-statistics

crissyfield / troll-a

oscar-project / goclassy

commoncrawl / cc-webgraph

commoncrawl / cc-notebooks

IBM / cc-dbp

bminixhofer / gerpt2

cisnlp / GlotCC

oscar-project / oscar-website

HRN-Projects / common_crawl_with_scrapy

Mgosi / Big-Data-Analysis-using-MapReduce-in-Hadoop

toimik / CommonCrawl

hrbrmstr / cc

tokenmill / common-crawl-utils

alumik / common-crawl-downloader

code402 / warc-benchmark

ilyankou / cc-gpx

connor-marchand / gau-python

mwoss / mors

neil-zt / common-crawl-client

socket-var / nyt-twitter-cc-hadoop

fizerkhan / cdx-index-client

fizerkhan / CommonCrawlDocumentDownload

fizerkhan / KeywordAnalysis

bottomless-archive-project / common-crawl-client

bottomless-archive-project / url-collector

Dahouabdelhalim / Discourse-marksers-and-Web-crawling

hadrianw / abracabra

skyler-myers-db / Common-Crawl-Analysis

srmocher / fake-science

Vikasg7 / warc-reader