PyThaiNLP / Han-solo

🪿 Han-solo: Thai syllable segmenter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🪿 Han-solo

🪿 Han-solo: Thai syllable segmenter

This work wants to create a Thai syllable segmenter that can work in the Thai social media domain.

Dataset: Han-solo: Thai syllable segmenter

Google colab: Demo

Dataset

This work uses 2 datasets:

  1. Nutcha Dataset (Thai news domain). See more data_nutcha/
  2. Han-solo: Thai syllable segmenter dataset (Thai social media domain). See more Han-solo: Thai syllable segmenter

Model

This work uses the CRF model that uses the same feature from ssg to the training model.

You can see the training notebook from train.ipynb.

The model file: han_solo.crfsuite

F1-score

1 is split, and 0 is not split.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     61078
           1       1.00      0.99      0.99     29468

    accuracy                           1.00     90546
   macro avg       1.00      1.00      1.00     90546
weighted avg       1.00      1.00      1.00     90546

How to use?

  • See using.ipynb
  • PyThaiNLP v4.1+

License

  • CC-BY 4.0 license (for Dataset)
  • Apache License Version 2.0 (for Source code and model)

Cite as

Wannaphong Phatthiyaphaibun. (2023). Han-solo: Thai syllable segmenter (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8196608

or BibTeX entry:

@dataset{wannaphong_phatthiyaphaibun_2023_8196608,
  author       = {Wannaphong Phatthiyaphaibun},
  title        = {Han-solo: Thai syllable segmenter},
  month        = jul,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.8196608},
  url          = {https://doi.org/10.5281/zenodo.8196608}
}

About

🪿 Han-solo: Thai syllable segmenter

License:Apache License 2.0


Languages

Language:Jupyter Notebook 83.5%Language:Python 16.5%