Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

In the way of data science, we believe every scholar, scientists might have heard about MNIST dataset, or played with Fashion MNIST. As a traditional Chinese user, we couldn't help but wonder: is it possible for machine learning, neural networks to recognize handwritten traditional Chinese characters? Let's challenge!

在走過資料科學的路上，相信每一位學者、科學家都聽過 MNIST dataset (手寫數字資料集)，或許也玩過 Fashion MNIST；身為繁體中文使用者，難免開始好奇：手寫繁體中文是否也有機會讓機器學習、神經網路成功辨識呢？讓我們一起來挑戰！

Description 資料集說明

Original dataset was produced based on Tegaki, an open-source package. Total 13,065 different Chinese characters, with average of 50 samples for each character.

原始資料集基於 Tegaki 開源套件下產出，總計 13,065 個不同的中文字，每一個字體平均有 50 個樣本。

Updates 更新紀錄

2021.04.17 專案衍生應用： Web-based 模型訓練、手寫辨識
2021.04.14 (非直接相關) 趨勢科技 T-brain 玉山人工智慧挑戰賽2021夏季賽：繁體中文場景文字辨識競賽
2020.09.03 Released the whole dataset (13,065 charaters; image size: 300x300pixels; total 684,677 images)
2020.04.29 分享使用繁體中文手寫字集實現卷積神經網路手寫識別實作 (感謝 Yen-Lin 博士熱情貢獻)
2020.04.21 提供資料集部署操作範例 (感謝 Yen-Lin 博士熱情貢獻)
2020.04.20 上傳最新資料集 (4,803個常用字；圖片大小：50x50pixels；共計 250,712 個圖片檔) (教育部 4,808 個常用字)
2020.04.20 Uploaded the first dataset (4,803 charaters; image size: 50x50pixels; total 250,712 images)

Data samples 資料樣本

完整資料集 - 各樣本資料夾
手寫"自由"範例

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

git clone https://github.com/chenkenanalytic/handwritting_data_all.git

cat (file_path)/all_data.zip* > (file_path)/all_data.zip

unzip -O big5 (file_path)/all_data.zip -d (output_path)

※ (file_path) & (output_path) 以實際檔案位置需求作修改、替換，解壓縮後資料夾名稱為 cleaned_data，共684,677個圖片。

完整資料集 - 部署操作

Colab操作程式碼參考

2. 常用字資料集 - common words Dataset (4,803 characters)

git clone https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset.git

※ 下載常用字資料集後，解壓縮 data 資料夾內的四個檔案，解壓縮後資料夾名稱為 cleaned_data(50_50)，共250,712個圖片。

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Colab操作程式碼參考

本地操作程式碼參考

Issues 問題與發現

常用字資料集因壓縮至 50x50 Pixels，發現部分圖片檔筆畫不清楚、出現重疊現象。 (完整資料集較無此問題，資料為 300x300 Pixels)
~~完整資料集佈署範例於 Colab 上解壓縮後，中文字集檔名會出現亂碼~~。(issue solved, please see #issue 1, credit to ling199104)

Handwriting Chinese Characters Recognition 手寫中文辨識

Repo Introdcution 專案介紹

使用繁體中文手寫字集實現卷積神經網路手寫識別。

Applied Traditional-Chinese-Handwriting-Dataset to realize handwriting recognition by CNN model.

若您對於進一步此實作感興趣，歡迎參考此文章說明。

Project Application 專案衍伸應用 - Web-based 模型訓練、手寫辨識

The application was developed based on the week 2 homework of Browser-based Models with TensorFlow.js in TensorFlow: Data and Deployment Specialization on coursera.

此衍生應用基於 Deeplearning.ai 之 Coursera 線上課程，TensorFlow: Data and Deployment Specialization 的第一堂課程：Browser-based Models with TensorFlow.js 的第二週線上作業所開發。

若您對於此專案有興趣，歡迎參考此文章說明。

License 授權

(CC BY-NC-SA 4.0)
本資料集適用 Attribution-NonCommercial-ShareAlike 4.0 International 授權。
The dataset applied Attribution-NonCommercial-ShareAlike 4.0 International license.

※ 使用、改作、分享請附上以下資訊：

本數據集由 AI . FREE Team 改作開發自 [STUST EECS_Chinese MNIST(總集)]。如有使用、改作、分享，請註明出處及此訊息。
The dataset is AI . FREE Team development from [STUST EECS_Chinese MNIST(總集)]. If used, modified, or shared, please cite the source and the mesage.
(source: https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset )

Citing

@misc{AI.FREE2020,
  author = {Po-Chuan Chen},
  title = {Traditional Chinese Handwriting Dataset},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset}},
}

Source 資料來源

原資料集來源：https://scidm.nchc.org.tw/dataset/stusteecs_chinese_mnist

介紹說明影片：https://www.youtube.com/watch?v=eJy1BtkqHX4

來源說明：本數據集開發修改自南臺科技大學電子系所提供之中文手寫字集。

Description: The Dataset is developed from Chinese handwriting data set, which is provided by Dept. EECS, Southern Taiwan University of Science and Technology.

Ryan315 / Traditional-Chinese-Handwriting-Dataset

Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

Description 資料集說明

Updates 更新紀錄

Data samples 資料樣本

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

完整資料集 - 部署操作

2. 常用字資料集 - common words Dataset (4,803 characters)

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Issues 問題與發現

Handwriting Chinese Characters Recognition 手寫中文辨識

Project Application 專案衍伸應用 - Web-based 模型訓練、手寫辨識

License 授權

Citing

Source 資料來源

About

Languages

Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

Description 資料集說明

Updates 更新紀錄

Data samples 資料樣本

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

完整資料集 - 部署操作

2. 常用字資料集 - common words Dataset (4,803 characters)

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士 熱情貢獻)

Issues 問題與發現

Handwriting Chinese Characters Recognition 手寫中文辨識

Project Application 專案衍伸應用 - Web-based 模型訓練、手寫辨識

License 授權

Citing

Source 資料來源

About

Languages

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)