There are 3 repositories under corpus-tools topic.
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
OpusFilter - Parallel corpus processing toolkit
Utilities for Processing the Switchboard Dialogue Act Corpus
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
A set of workflows for corpus building through OCR, post-correction and normalisation
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between two speakers in naturalistic language corpora.
Rezonator: Dynamics of human engagement
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Collector and speech cutter for librivox audiobooks
A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Searching in-memory corpus with Corpus Query Language (CQL)
An Interactive Tool for Annotating Discourse Structure and Text Improvement
Script that sets up and configures an entire CQPweb server installation
Measure the similarity of text corpora for 74 languages
Library for Python to use Korp API
Scripts for building a geo-located web corpus using Common Crawl data
Utilities for Processing the HCRC Map Task Corpus