Adrien Barbaresi's repositories
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
german-reddit
Extraction of a German Reddit Corpus
awesome-crawler
A collection of awesome web crawler,spider in different languages
flux-toolchain
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
tweets-tools
Diverse tools used with Twitter data
coronakorpus
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
jlcl-style
Experiments to modernize the LaTeX class of the JLCL
microblog-explorer
Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
url-compressor
A fast pattern-based URL compression for lists of links
vardial-experiments
Experiments conducted on the occasion of the VarDial shared tasks
zeitcrawler
Automatically exported from code.google.com/p/zeitcrawler
awesome-digital-humanities
Software for humanities scholars using quantitative or computational methods.
awesome-web-scraping
List of libraries, tools and APIs for web scraping and data processing.
corpus-visualizer
Explore, visualize and publish corpora as CSS/XHTML documents
equipe-crawler
Automatically exported from code.google.com/p/equipe-crawler
gps-corpus-builder
Automatically exported from code.google.com/p/gps-corpus-builder
valency-oriented-chunker
A one-pass FSA valency-oriented chunker for German (proof of concept)