Sebastian Nagel (sebastian-nagel)

sebastian-nagel

Geek Repo

Company:@commoncrawl

Location:Konstanz, Germany

Home Page:https://de.linkedin.com/pub/sebastian-nagel/35/320/8b4

Github PK Tool:Github PK Tool

Sebastian Nagel's repositories

warc-crawler

Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr

Language:FLUXStargazers:8Issues:4Issues:0
Language:ShellLicense:Apache-2.0Stargazers:3Issues:2Issues:0

docker-hadoop

Apache Hadoop docker image

Language:ShellStargazers:2Issues:1Issues:0
Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2Issues:2Issues:0

nutch

Mirror of Apache Nutch

Language:JavaLicense:Apache-2.0Stargazers:2Issues:3Issues:0

browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

Language:JavaScriptLicense:AGPL-3.0Stargazers:1Issues:1Issues:0

news-please

news-please - an integrated web crawler and information extractor for news that just works.

Language:PythonLicense:Apache-2.0Stargazers:1Issues:1Issues:0

pywb

Python WayBack for web archive replay and url-rewriting HTTP/S web proxy

Language:PythonLicense:GPL-3.0Stargazers:1Issues:2Issues:0

storm-crawler

Web crawler SDK based on Apache Storm

Language:HTMLLicense:Apache-2.0Stargazers:1Issues:2Issues:0

cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Language:PythonLicense:MITStargazers:0Issues:0Issues:0

crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language:JavaLicense:Apache-2.0Stargazers:0Issues:1Issues:0

data_tooling

Tools for managing datasets for governance and training.

Language:HTMLLicense:Apache-2.0Stargazers:0Issues:0Issues:0

duckdb-web

DuckDB-Web - Source code of duckdb.org

Language:JavaScriptStargazers:0Issues:0Issues:0

impf-botpy

Impf Bot.py 🐍⚡ – Automatisierung für den Corona ImpfterminService Bot

Language:PythonStargazers:0Issues:1Issues:0

jwarc

Java library for reading and writing WARC files with a typed API

Language:JavaLicense:Apache-2.0Stargazers:0Issues:1Issues:0

ossym2022-robotstxt-experiments

Experiments and metrics about robots.txt captures, presentation at #ossym2022

Language:Jupyter NotebookLicense:MITStargazers:0Issues:1Issues:0

sfm-docker

Docker support for Social Feed Manager.

Language:ShellLicense:MITStargazers:0Issues:1Issues:0
Language:PythonStargazers:0Issues:1Issues:0
Language:PythonStargazers:0Issues:1Issues:0

sfm-twitter-harvester

A harvester for twitter content as part of Social Feed Manager.

Language:PythonLicense:MITStargazers:0Issues:1Issues:0

sfm-ui

Social Feed Manager user interface application.

Language:PythonLicense:MITStargazers:0Issues:1Issues:0

sfm-utils

Utilities to support Social Feed Manager

Language:PythonLicense:MITStargazers:0Issues:1Issues:0
Language:PythonStargazers:0Issues:2Issues:0
Language:JavaLicense:Apache-2.0Stargazers:0Issues:2Issues:0
Language:PythonStargazers:0Issues:0Issues:0

tika

Mirror of Apache Tika

Language:JavaLicense:Apache-2.0Stargazers:0Issues:2Issues:0

twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.

Language:PythonLicense:MITStargazers:0Issues:0Issues:0

uap-core

The regex file necessary to build language ports of Browserscope's user agent parser.

Language:JavaScriptLicense:NOASSERTIONStargazers:0Issues:2Issues:0

warcio

Streaming WARC/ARC library for fast web archive IO

Language:PythonLicense:Apache-2.0Stargazers:0Issues:2Issues:0

wdc-page

This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl

Language:HTMLStargazers:0Issues:0Issues:0