There are 9 repositories under word-segmentation topic.
Unsupervised text tokenizer for Neural Network-based text generation.
Unsupervised text tokenizer focused on computational efficiency
Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
CKIP Transformers
AdaSeq: An All-in-One Library for Developing State-of-the-Art Sequence Understanding Models
Cantonese Linguistics and NLP
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of spee
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
A PyTorch implementation of the BI-LSTM-CRF model.
Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.
Source codes for paper "Neural Networks Incorporating Dictionaries for Chinese Word Segmentation", AAAI 2018
🗺️ 一个自然语言处理的学习路线图
A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)
Fast Word Segmentation with Triangular Matrix
A toolkit for Vietnamese word segmentation
Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
Vietnamese Word Tokenize
A toolkit for pre-processing large source code corpora