Adrien Barbaresi's repositories
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
german-reddit
Extraction of a German Reddit Corpus
awesome-crawler
A collection of awesome web crawler,spider in different languages
awesome-web-scraper
A collection of awesome web scaper, crawler.
flux-toolchain
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
tweets-tools
Diverse tools used with Twitter data
coronakorpus
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
jlcl-style
Experiments to modernize the LaTeX class of the JLCL
vardial-experiments
Experiments conducted on the occasion of the VarDial shared tasks
datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
dateparser
python parser for human readable dates
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
valency-oriented-chunker
A one-pass FSA valency-oriented chunker for German (proof of concept)