adbar

Adrien Barbaresi's repositories

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Language:PythonApache-2.03761 30 391

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

454 45 5

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language:PythonMIT147 7 69

courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Language:PythonApache-2.0127 3 32

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line

Language:PythonApache-2.0122 5 58

py3langid

Faster, modernized fork of the language identification tool langid.py

Language:PythonNOASSERTION49 2 4

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization

Language:PythonGPL-3.05 40

german-reddit

Extraction of a German Reddit Corpus

Language:PythonMIT4 2 1

awesome-crawler

A collection of awesome web crawler,spider in different languages

MIT2 20

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain

Language:Perl2 30

tweets-tools

Diverse tools used with Twitter data

Language:PythonMIT2 30

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus

NOASSERTION1 40

jlcl-style

Experiments to modernize the LaTeX class of the JLCL

Language:TeX1 3 1

microblog-explorer

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language

Language:Python1 30

toponyms

Old prototype for toponym extraction in historical texts written in German

GPL-3.01 30

url-compressor

A fast pattern-based URL compression for lists of links

Language:Pascal1 20

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks

Language:PythonGPL-3.01 20

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler

Language:JavaGPL-3.01 20

adbar

030

awesome-digital-humanities

Software for humanities scholars using quantitative or computational methods.

Language:HTMLCC0-1.0000

awesome-web-scraping

List of libraries, tools and APIs for web scraping and data processing.

Language:MakefileNOASSERTION000

btw21

Visualization of the most frequent words in the German federal election in 2021

Language:Jupyter NotebookMIT010

corpus-visualizer

Explore, visualize and publish corpora as CSS/XHTML documents

Language:CSS020

equipe-crawler

Automatically exported from code.google.com/p/equipe-crawler

Language:Perl020

gps-corpus-builder

Automatically exported from code.google.com/p/gps-corpus-builder

Language:Perl020

jparser

A readability parser which can extract title, content, images from html pages

Language:PythonMIT020

laclos

LAnguage-CLassified OpenSubtitles

Language:PythonLGPL-3.0020

valency-oriented-chunker

A one-pass FSA valency-oriented chunker for German (proof of concept)

Language:PerlLGPL-3.0020