There are 7 repositories under text-segmentation topic.
Accelerated deep learning R&D
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
Desktop app for automatically translating comics - BDs, Manga, Manhwa, Fumetti and more in a variety of formats (Image, Pdf, Epub, cbr, cbz, etc) and in multiple languages.
Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).
A sentence segmenter that actually works!
Implementation of the paper: Text Segmentation as a Supervised Learning Task
(yet another not really) awesome topic/text segmentation list
Fast Word Segmentation with Triangular Matrix
Automatic Manga Translator
Fast SymSpell written in c++ and exposes to python via pybind11
Printed and handwritten text segmentation using fully convolutional networks and CRF post-processing
Uses GloVe embeddings and greedy sequence segmentation to semantically segment a text document into any number of k segments.
Mandarin Chinese text segmentation and mobile dictionary Android app (中文分词)
Word Segmentation with Dynamic Programming
Text Segmentation 관련 논문 정리
Cutting-edge tool that unlocks the full potential of semantic chunking
Text segmentation into separate words using a simple unigram model and the Viterbi algorithm
Neural and nonneural text segmentation methods.
Automate video chaptering with LLMs and TF-IDF: Transform raw transcripts into well-structured documents
Image Analysis Toolkit for text document Binarization & Segmentation written in TypeScript.
Data for the ACL 2020 paper - Improving Segmentation for Technical Support Problems
This project aimed to perform text segmentation in images using AutoEncoders.
How to add user dictionary to MeCab
"WBSUBNdb_text: Bangla handwritten text document dataset" is a Bangla text dataset containing 1383 offline handwritten text documents contributed by 190 writers. The dataset is composed of both simple and compound characters.
Demonstration of dynamic programming for segmenting strings into words. Just for fun!