There are 1 repository under corpus-processing topic.
Python scripts preprocessing Penn Treebank and Chinese Treebank
OpusFilter - Parallel corpus processing toolkit
Utilities for Processing the Switchboard Dialogue Act Corpus
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
A simple collocation-driven recognition of rhymes. Contains pre-trained models for Czech, Dutch, English, French, German, Russian, and Spanish poetry
A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.
Korpuslinguistik war noch nie so einfach...
Hard-Forked from JuliaText/TextAnalysis.jl
Measure the similarity of text corpora for 74 languages
A set of corpus-based sampling & analysis M4L devices
Scripts for building a geo-located web corpus using Common Crawl data
Plotly-Dash NLP project. Document similarity measure using Latent Dirichlet Allocation, principal component analysis and finally follow with KMeans clustering. Project is completed with dynamic visual interaction.
Script that sets up and configures an entire CQPweb server installation
A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
Utilities for Processing the HCRC Map Task Corpus
Corpus processing library
General Missives in Text-Fabric
uniblock, scoring and filtering corpus with Unicode block information (and more).
Paper that Giuseppe Samo and I are working on as part of my SNSF-funded 'Focus in diachrony' research project at the University of Cambridge, UK.
Minimal HTK for supporting HTK in Vietnamese.
N-Gram language model that learns n-gram probabilities from a given corpus and generates new sentences from it based on the conditional probabilities from the generated words and phrases.
Corpus processing library
Repositório para disponibilização de bases de dados do Wikipedia e Simple Wikipedia pré-processadas, além de scripts de pré-processamento e geração de bases em Python.
Frequency List Wizard is a command-line program that does various useful things with... frequency lists.
Corpus processing library
Mozilla Firefox places.sqlite tables exported to XML files. A Bash script.
A basic search engine to index a corpus for searching and rank the document data set.