DataLama / chinese-tokenizer

chinese word segmentation comparison test

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chinese Tokenizer

Simple Comparison of chinese word segmentation and pos tagger for text mining.

1) Chinese Word Segmentation comparison

The most basic feature of chinese tokenizer is word segmentation

PKUSEG

LAC

thulac

monpa

jiagu

hanlp

[etc] deeplearning based SoTA model

hanlp, LAC is good to use.

2) Chinese KeyPhrase Extractor

  • hanlp + pos
  • lac + pos
  • BERT-JointKPE

About

chinese word segmentation comparison test


Languages

Language:Jupyter Notebook 97.0%Language:Python 3.0%