Sebastian Nagel's repositories
warc-crawler
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
docker-hadoop
Apache Hadoop docker image
browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
news-please
news-please - an integrated web crawler and information extractor for news that just works.
storm-crawler
Web crawler SDK based on Apache Storm
cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
data_tooling
Tools for managing datasets for governance and training.
duckdb-web
DuckDB-Web - Source code of duckdb.org
impf-botpy
Impf Bot.py 🐍⚡ – Automatisierung für den Corona ImpfterminService Bot
ossym2022-robotstxt-experiments
Experiments and metrics about robots.txt captures, presentation at #ossym2022
sfm-docker
Docker support for Social Feed Manager.
sfm-twitter-harvester
A harvester for twitter content as part of Social Feed Manager.
twarc-csv
A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
wdc-page
This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl