Adrien Barbaresi (adbar)

adbar

Geek Repo

Company:Berlin-Brg. Academy of Sciences (BBAW)

Location:Berlin

Home Page:adrien.barbaresi.eu

Twitter:@adbarbaresi

Github PK Tool:Github PK Tool


Organizations
deutschestextarchiv
zentrum-lexikographie

Adrien Barbaresi's repositories

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Language:PythonLicense:Apache-2.0Stargazers:3761Issues:30Issues:391

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language:PythonLicense:MITStargazers:147Issues:7Issues:69

courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Language:PythonLicense:Apache-2.0Stargazers:127Issues:3Issues:32

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line

Language:PythonLicense:Apache-2.0Stargazers:122Issues:5Issues:58

py3langid

Faster, modernized fork of the language identification tool langid.py

Language:PythonLicense:NOASSERTIONStargazers:49Issues:2Issues:4

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization

Language:PythonLicense:GPL-3.0Stargazers:5Issues:4Issues:0

german-reddit

Extraction of a German Reddit Corpus

Language:PythonLicense:MITStargazers:4Issues:2Issues:1

awesome-crawler

A collection of awesome web crawler,spider in different languages

License:MITStargazers:2Issues:2Issues:0

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain

Language:PerlStargazers:2Issues:3Issues:0

tweets-tools

Diverse tools used with Twitter data

Language:PythonLicense:MITStargazers:2Issues:3Issues:0

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus

License:NOASSERTIONStargazers:1Issues:4Issues:0

jlcl-style

Experiments to modernize the LaTeX class of the JLCL

microblog-explorer

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language

Language:PythonStargazers:1Issues:3Issues:0

toponyms

Old prototype for toponym extraction in historical texts written in German

License:GPL-3.0Stargazers:1Issues:3Issues:0

url-compressor

A fast pattern-based URL compression for lists of links

Language:PascalStargazers:1Issues:2Issues:0

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks

Language:PythonLicense:GPL-3.0Stargazers:1Issues:2Issues:0

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler

Language:JavaLicense:GPL-3.0Stargazers:1Issues:2Issues:0
Stargazers:0Issues:3Issues:0

awesome-digital-humanities

Software for humanities scholars using quantitative or computational methods.

Language:HTMLLicense:CC0-1.0Stargazers:0Issues:0Issues:0

awesome-web-scraping

List of libraries, tools and APIs for web scraping and data processing.

Language:MakefileLicense:NOASSERTIONStargazers:0Issues:0Issues:0

btw21

Visualization of the most frequent words in the German federal election in 2021

Language:Jupyter NotebookLicense:MITStargazers:0Issues:1Issues:0

corpus-visualizer

Explore, visualize and publish corpora as CSS/XHTML documents

Language:CSSStargazers:0Issues:2Issues:0

equipe-crawler

Automatically exported from code.google.com/p/equipe-crawler

Language:PerlStargazers:0Issues:2Issues:0

gps-corpus-builder

Automatically exported from code.google.com/p/gps-corpus-builder

Language:PerlStargazers:0Issues:2Issues:0

jparser

A readability parser which can extract title, content, images from html pages

Language:PythonLicense:MITStargazers:0Issues:2Issues:0

laclos

LAnguage-CLassified OpenSubtitles

Language:PythonLicense:LGPL-3.0Stargazers:0Issues:2Issues:0

valency-oriented-chunker

A one-pass FSA valency-oriented chunker for German (proof of concept)

Language:PerlLicense:LGPL-3.0Stargazers:0Issues:2Issues:0