Adrien Barbaresi (adbar)

adbar

Geek Repo

Company:Berlin-Brg. Academy of Sciences (BBAW)

Location:Berlin

Home Page:adrien.barbaresi.eu

Twitter:@adbarbaresi

Github PK Tool:Github PK Tool


Organizations
deutschestextarchiv
zentrum-lexikographie

Adrien Barbaresi's repositories

trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Language:PythonLicense:Apache-2.0Stargazers:2776Issues:28Issues:307

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language:PythonLicense:MITStargazers:125Issues:5Issues:56

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line

Language:PythonLicense:Apache-2.0Stargazers:107Issues:5Issues:49

courlan

Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters

Language:PythonLicense:Apache-2.0Stargazers:65Issues:3Issues:24

py3langid

Faster, modernized fork of the language identification tool langid.py

Language:PythonLicense:NOASSERTIONStargazers:36Issues:1Issues:1

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization

Language:PythonLicense:GPL-3.0Stargazers:5Issues:4Issues:0

german-reddit

Extraction of a German Reddit Corpus

Language:PythonLicense:MITStargazers:3Issues:2Issues:1

awesome-crawler

A collection of awesome web crawler,spider in different languages

License:MITStargazers:2Issues:2Issues:0

awesome-web-scraper

A collection of awesome web scaper, crawler.

License:MITStargazers:2Issues:2Issues:0

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain

Language:PerlStargazers:2Issues:3Issues:0

tweets-tools

Diverse tools used with Twitter data

Language:PythonLicense:MITStargazers:2Issues:3Issues:0

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus

License:NOASSERTIONStargazers:1Issues:4Issues:0

jlcl-style

Experiments to modernize the LaTeX class of the JLCL

toponyms

Old prototype for toponym extraction in historical texts written in German

License:GPL-3.0Stargazers:1Issues:3Issues:0
Language:PythonLicense:GPL-3.0Stargazers:1Issues:2Issues:0

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks

Language:PythonLicense:GPL-3.0Stargazers:1Issues:2Issues:0
Stargazers:0Issues:3Issues:0

archiveis

A simple Python wrapper for the archive.is capturing service

Language:PythonLicense:MITStargazers:0Issues:2Issues:0

btw21

Visualization of the most frequent words in the German federal election in 2021

Language:Jupyter NotebookLicense:MITStargazers:0Issues:1Issues:0

cChardet

universal character encoding detector

Language:PythonLicense:NOASSERTIONStargazers:0Issues:2Issues:0

datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

dateparser

python parser for human readable dates

Language:PythonLicense:BSD-3-ClauseStargazers:0Issues:3Issues:0

dwdsmor

SFST/SMOR/DWDS-based German Morphology

Language:XSLTLicense:LGPL-3.0Stargazers:0Issues:1Issues:0

jparser

A readability parser which can extract title, content, images from html pages

Language:PythonLicense:MITStargazers:0Issues:2Issues:0

jusText

Heuristic based boilerplate removal tool

Language:PythonLicense:BSD-2-ClauseStargazers:0Issues:1Issues:0

python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

Language:HTMLStargazers:0Issues:2Issues:0
Language:PythonLicense:GPL-3.0Stargazers:0Issues:2Issues:1

valency-oriented-chunker

A one-pass FSA valency-oriented chunker for German (proof of concept)

Language:PerlLicense:LGPL-3.0Stargazers:0Issues:2Issues:0
Language:PythonLicense:MITStargazers:0Issues:1Issues:0