Vietnamese Handwritten OCR (Top5 Kalapa Challenge 2023)

Problem Statements

Problem: Building a lightweight model suitable for mobile devices to perform Vietnamese Handwritten OCR in the context of Vietnamese addresses
Input: Raw image with one line text
Output: the text in the input image
Metric: the custom of edit distance between output with lable
Requirements:
- Model size <= 50mb
- Inference time <= 2s
- No pretrained model for OCR task or handwritten dataset
Some issues with data:
- White space at the end of the image.
- Short text lacking linguistic context.
- Excessive use of colors.
- Two lines of text.
- Text not fully visible.
- Empty images.
Ideas:
- Choose a very lightweight OCR model: SVTR
- Train a pretrained model with generated data
- Finetune on the real dataset

Prepare data:

|___data
|    |___train
|    |    |___images
|    |    |    |___0.jpg
|    |    |    |___...
|    |    |___labels
|    |    |    |___0.txt
|    |    |    |___...
|    |___val
|    |    |___images
|    |    |    |___0.jpg
|    |    |    |___...
|    |    |___labels
|    |    |    |___0.txt
|    |    |    |___...

Pretrained

Collect address text:
- Extract data from an Excel file provided by the government.
- Get text label from other OCR datasets
- Crawl information on villages from Google.
To generate data, use some handwritten fonts and the text corpus to generate with my repo OCR-Handwritten-Text-Generator

Then, apply some augmentation in above repository
Total: 250k - 350k images

Finetuned

Manually check to crop 2 line image and correct the label
To crop image to remove the white part at the end, help handle the empty image

python3 main.py --scenario preprocess \
--raw_data_path "./path/to/raw/data/"

Then, create lmdb data from raw data:

python3 main.py --scenario create_lmdb_data \
--raw_data_path "./data/OCR/training_data" \
--raw_data_type "folder" \
--data_mode "train" \
--lmdb_data_path "./data/kalapa_lmdb/"

Flag:
- raw_data_path: path to raw data
- raw_data_type: have 3 values:
  - json: a dir contains image and a json file with each line contains path to image and text label.
  - folder: a dir contains image subdirs and a dir contains subfile .txt label.
  - other: the second gen type from my repo.
- data_mode: train data or eval data
- lmdb_data_path: path to output lmdb data

Trainning

To run training:

python3 main.py --scenario train \
--model SVTR \
--lmdb_data_path "./data/kalapa_lmdb/"
--batch_size 16
--num_epoch 1000

To run inference test:

python3 main.py --scenario infer --image_test_path "path/to/image.jpg"

Postprocess

To handle some cases that not show fully sigh or very ugly text, or model is wrong -> decode using beamsearch with ngram model
To build ngram model from the text file generated from the preprocess part, go https://github.com/kmario23/KenLM-training

Export Onxx

To export model to onnx (optional):

python3 export_onnx.py

Submission

To run infer with a folder:
- run in batch:
```
python3 submission.py
```
- run each image:
```
python3 torch_submission.py
```
- run each image with onnx:
```
python3 onnx_submission.py
```

trinhtuanvubk / handwritten-ocr

Vietnamese Handwritten OCR (Top5 Kalapa Challenge 2023)

Problem Statements

Prepare data:

Pretrained

Finetuned

Trainning

Postprocess

Export Onxx

Submission

About

Languages