text-processing

There are 27 repositories under text-processing topic.

learnbyexample / Command-line-text-processing
:zap: From finding text to search and replace, from sorting to beautifying text and more :art:
awk command-line ebook grep linux perl regex ruby sed text-processing
Language:Shell 10198
PyMuPDF
pymupdf / PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
mupdf xps pdf-documents epub ocr pdf font python data-science extract-data table-extraction pymupdf tesseract text-processing text-shaping
Language:Python 8040
google / diff-match-patch
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
difference diff match patch text-processing
Language:Python 7903
chmln / sd
Intuitive find & replace CLI (sed alternative)
command-line rust terminal text-processing regex cli
Language:Rust 6581
fastnlp / fastNLP
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
chinese-nlp deep-learning natural-language-processing nlp-library nlp-parsing text-classification text-processing
Language:Python 3138
pyparsing / pyparsing
Python library for creating PEG parsers
python python2 python3 python-2 python-3 parser-combinators parsing-expression-grammar parsing parsing-library text-processing peg-parsers
Language:Python 2385
kk7nc / Text_Classification
Text Classification Algorithms: A Survey
text-classification nlp-machine-learning document-classification text-processing dimensionality-reduction rocchio-algorithm boosting-algorithms logistic-regression naive-bayes-classifier k-nearest-neighbours support-vector-machines decision-trees random-forest conditional-random-fields deep-learning deep-neural-network recurrent-neural-networks convolutional-neural-networks deep-belief-network hierarchical-attention-networks
Language:Python 1820
lingua-go
pemistahl / lingua-go
The most accurate natural language detection library for Go, suitable for short text and mixed-language text
natural-language-processing language-detection language-recognition language-classification language-identification language-processing nlp nlp-machine-learning golang-library go language-modeling text-processing
Language:Go 1278
roshan-research / hazm
Persian NLP Toolkit
dependency-parser embeddings farsi lemmatization natural-language-processing nlp normalization persian persian-nlp pos-tagging python text-processing tokenizer
Language:Python 1270
birchb1024 / frangipanni
Program to convert lines of text into a tree structure.
go golang text-processing tree-structure
Language:Go 1199
BurntSushi / aho-corasick
A fast implementation of Aho-Corasick in Rust.
aho-corasick finite-state-machine search substring-matching text-processing
Language:Rust 1137
PyThaiNLP / pythainlp
Thai natural language processing in Python
python thai-nlp nlp-library thai-language natural-language-processing thai-nlp-library thai-soundex soundex word-segmentation thai hacktoberfest computational-linguistics text-processing
Language:Python 1070
helix-editor / nucleo
A fast and convenient fuzzy matcher library for rust
fuzzy-matching fuzzy-search performance rust text-processing
Language:Rust 1034
ChenghaoMou / text-dedup
All-in-one text de-duplication
text-processing de-duplication nlp data-processing
Language:Python 714
sstadick / hck
A sharp cut(1) clone.
command-line rust text-processing
Language:Rust 711
derek73 / python-nameparser
A simple Python module for parsing human names into their individual components
python text-processing text-parser python-module
Language:Python 686
cbaziotis / ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
nlp text-processing nlp-library spelling-correction tokenizer tokenization word-segmentation word-normalization spell-corrector text-segmentation semeval
Language:Python 671
abadojack / whatlanggo
Natural language detection library for Go
go language nlp text-processing
Language:Go 654
wenet-e2e / WeTextProcessing
Text Normalization & Inverse Text Normalization
normalization text-processing production-ready
Language:Python 652
open-korean-text / open-korean-text
Open Korean Text Processor - An Open-source Korean Text Processor
korean korean-text-processing natural-language-processing text-processing tokenizer korean-tokenizer
Language:Scala 624
lukaszliniewicz / Pandrator
Turn PDFs and EPUBs into audiobooks, subtitles or videos into dubbed videos (including translation), and more. For free. Pandrator uses local models, notably XTTS, including voice-cloning (instant, RVC-enhanced, XTTS fine-tuning) and LLM processing. It aspires to be a user-friendly app with a GUI, an installer and all-in-one packages.
audiobook audiobook-creator audiobook-maker audiobooks text-processing text-to-speech customtkinterprojects llm rvc tkinter-gui xtts xttsv2 silero voice-cloning voicecraft dubbing pdf-to-audio subtitle-to-speech subtitle-to-voice voice-clone
Language:Python 493
Puchaczov / Musoq
SQL Syntax without any database
ai-assisted-queries cross-platform csharp csv data-analysis-sql data-exploration data-processing dotnet dotnet-core dotnetcore file-system plugin-architecture query-language sql text-processing
Language:C# 488
proycon / pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
nlp python computational-linguistics linguistics library folia machine-learning language-modelling search-algorithms evaluation-metrics text-processing nlp-library natural-language-processing
Language:Python 477
pyarabic
linuxscout / pyarabic
pyarabic
nlp-library arabic-language text-processing
Language:Python 465
PyKoSpacing
haven-jeon / PyKoSpacing
Automatic Korean word spacing with Python
korean-nlp nlp spacing text-processing
Language:Python 413
andrewbihl / bsed
Simple SQL-like syntax on top of Perl text processing.
sed perl grep awk python text-processing csv domain-specific-language
Language:Python 411
airbnb / artificial-adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
machine-learning classification python python3 python2 text text-mining adversarial-examples spam spam-filtering spam-detection spam-classification text-classification text-analysis data-science data-mining text-processing black-box-benchmarking black-box-attacks metrics
Language:Python 402
BurntSushi / regex-automata
A low level regular expression library that uses deterministic finite automata.
automata automaton dfa nfa regex regex-engine regexp rust text-processing
Language:Rust 349
ikegami-yukino / jaconv
Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
japanese-language preprocessing text-processing japanese-kana pure-python character-converter julius transliteration
Language:Python 333
gagolews / stringi
Fast and portable character string processing in R (with the Unicode ICU)
stringi icu icu4c r regex regexp string-manipulation unicode natural-language-processing text-processing text stringr nlp tidy-data
Language:C++ 309
textpipe / textpipe
Textpipe: clean and extract metadata from text
nlp named-entities named-entity-recognition text-processing text-analysis language-identification
Language:Python 302
RandyPen / TextCluster
短文本聚类预处理模块 Short text cluster
cluster nlp text-cluster text-clustering text-mining text-processing
Language:Python 275
rust-unic
open-i18n / rust-unic
UNIC: Unicode and Internationalization Crates for Rust
unicode internationalization text-processing crates rust cldr locale-data unic unicode-characters unicode-algorithms
Language:Rust 242
himkt / konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
janome japanese kytea mecab natural-language-processing nlp sentencepiece sudachi text-processing
Language:Python 241
larrykollar / Unix-Text-Processing
Recreated sources for the book "UNIX Text Processing," published in 1987.
gnu-troff unix utp utp-revival groff text-processing publishing formatting
Language:Roff 222
catatsuy / purl
Streamlining Text Processing
grep-like regexp sed text-processing
Language:Go 221