There are 9 repositories under common-crawl topic.
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
Process Common Crawl data with Python and Spark
News crawling with StormCrawler - stores content as WARC
A python utility for downloading Common Crawl data
:spider: The pipeline for the OSCAR corpus
Drill into WARC web archives
Statistics of Common Crawl monthly archives mined from URL index files
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Tools to construct and process webgraphs from Common Crawl data
Various Jupyter notebooks about Common Crawl data
German small and large versions of GPT2.
The website of the Oscar Project
We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
Various Common Crawl utilities in Clojure.
Distributed download scripts for Common Crawl data
Common Crawl's processing tools
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau
Perform big data analysis on New york times, Twitter and Common Crawl APIs
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
A Common Crawl client example for scraping specific websites.
This library is a very lightweight client to Common Crawl's WARC files.
An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
Discourse Markers identification in French Language
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
Parsing the common crawl database using Scala and Spark
Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)
ES6 Class to read .warc or .warc.gz file member by member in nodejs