sebastian-nagel

followers

following

stars

@commoncrawl

Konstanz, Germany

https://de.linkedin.com/pub/sebastian-nagel/35/320/8b4

Sebastian Nagel's repositories

warc-crawler

Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr

Language:FLUX8 40

nutch-test-single-node-cluster

Language:ShellApache-2.03 20

docker-hadoop

Apache Hadoop docker image

Language:Shell2 10

introduction-to-python

Language:Jupyter NotebookApache-2.02 20

nutch

Mirror of Apache Nutch

Language:JavaApache-2.02 30

browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

Language:JavaScriptAGPL-3.01 10

news-please

news-please - an integrated web crawler and information extractor for news that just works.

Language:PythonApache-2.01 10

pywb

Python WayBack for web archive replay and url-rewriting HTTP/S web proxy

Language:PythonGPL-3.01 20

storm-crawler

Web crawler SDK based on Apache Storm

Language:HTMLApache-2.01 20

cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Language:PythonMIT000

crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language:JavaApache-2.0010

data_tooling

Tools for managing datasets for governance and training.

Language:HTMLApache-2.0000

duckdb-web

DuckDB-Web - Source code of duckdb.org

Language:JavaScript000

impf-botpy

Impf Bot.py 🐍⚡ – Automatisierung für den Corona ImpfterminService Bot

Language:Python010

jwarc

Java library for reading and writing WARC files with a typed API

Language:JavaApache-2.0010

ossym2022-robotstxt-experiments

Experiments and metrics about robots.txt captures, presentation at #ossym2022

Language:Jupyter NotebookMIT010

sfm-docker

Docker support for Social Feed Manager.

Language:ShellMIT010

sfm-facebook-harvester

Language:Python010

sfm-instagram-harvester

Language:Python010

sfm-twitter-harvester

A harvester for twitter content as part of Social Feed Manager.

Language:PythonMIT010

sfm-ui

Social Feed Manager user interface application.

Language:PythonMIT010

sfm-utils

Utilities to support Social Feed Manager

Language:PythonMIT010

sfm-web-harvester-browsertrix

Language:Python020

sitemap-performance-test

Language:JavaApache-2.0020

suffix_cat

Language:Python000

tika

Mirror of Apache Tika

Language:JavaApache-2.0020

twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.

Language:PythonMIT000

uap-core

The regex file necessary to build language ports of Browserscope's user agent parser.

Language:JavaScriptNOASSERTION020

warcio

Streaming WARC/ARC library for fast web archive IO

Language:PythonApache-2.0020

wdc-page

This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl

Language:HTML000