There are 19 repositories under corpus-linguistics topic.
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
A list of Indonesian NLP resources.
A web-based engine for creating and annotating textual corpora
data resource untuk NLP bahasa indonesia
A curated list of NLP resources for Hungarian
Crawler for linguistic corpora
:spider: The pipeline for the OSCAR corpus
Kanji usage frequency data collected from various sources
Data for the quantitative study of (Vedic) Sanskrit
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
A textual corpus database for the digital humanities.
Quran, Hadith, Translations, Tafaseer, Corpus Linguistics. Everything for NLP
A set of workflows for corpus building through OCR, post-correction and normalisation
My solutions to selected exercises to "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.
Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.
Rezonator: Dynamics of human engagement
A large high-quality corpus of Chinese synonyms 一个大型、高质量的中文同义词语料库。
CONLL-U to Pandas DataFrame
The Data Format for Digital Linguistics (DaFoDiL)
Korpuslinguistik war noch nie so einfach...
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
A corpus of the Christmas speeches delivered by the head of state of Spain from 1937 to 2021
An Interactive Tool for Annotating Discourse Structure and Text Improvement