ThomasLech / CROHME_extractor

CROHME dataset extractor for OFFLINE-text-recognition task.

Home Page:http://blog.mathocr.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Abstract

CROHME datasets originally exhibit features designed for Online-handwritting recognition task.
Apart from drawn traces being encoded, inkml files also contain trace drawing time captured. So we need to extract new feature map, namely matrices of pixel intensities.

The following scripts will get you started with Offline math symbols recognition task.

Setup

All code is compatible with Python 3.5.* version.

  1. Extract CROHME_full_v2.zip (found inside data directory) contents before running any of the above scripts.

  2. Install specified dependencies with pip (Python Package Manager) using the following shell command:

pip install -U -r requirements.txt

Scripts info

  1. extract.py

    • Extracts trace groups from inkml files.
    • Converts extracted trace groups into images. Images are square shaped bitmaps with only black (value 0) and white (value 1) pixels. Black color denotes patterns (ROI).
    • Labels those images (according to inkml files).
    • Flattens images to one-dimensional vectors.
    • Converts labels to one-hot format.
    • Dumps training and testing sets separately into outputs folder.

    Command line arguments: -b [BOX_SIZE] -d [DATASET_VERSION] -c [CATEGORY] -t [THICKNESS]

    Example usage: python extract.py -b 50 -d 2011 2012 2013 -c digits lowercase_letters operators -t 5

    Caution: Script doesn't work properly for images bigger than 200x200 (For yet unknown reason).

  2. balance.py script balances the overall distribution of classes.

    Command line arguments: -b [BOX_SIZE] -ub [UPPER_BOUND][Optional]

    Example usage: python balance.py -b 50 -ub 6000

  3. visualize.py script will plot single figure depicting a random batch of extracted data.

    Command line arguments: -b [BOX_SIZE] -n [N_SAMPLES] -c [COLUMNS]

    Example usage: python visualize.py -b 50 -n 40 -c 8

    Sample Plot: crohme_extractor_plot

  4. extract_hog.py script will extract HoG features.
    This script accepts 1 command line argument, namely hog_cell_size.
    hog_cell_size corresponds to pixels_per_cell parameter of skimage.feature.hog function.
    We use skimage.feature.hog to extract HoG features.
    Example of script execution: python extract_hog.py 5 <-- pixels_per_cell=(5, 5)
    This script loads data previously dumped by extract.py and again dumps its outputs(train, test) separately.

  5. extract_phog.py script will extract PHoG features.
    For PHoG features, HoG feature maps using different cell sizes are concatenated into a single feature vector.
    So this script takes arbitrary number of hog_cell_size values(HoG features have to be previously extracted with extract_hog.py)
    Example of script execution: python extract_phog.py 5 10 20 <-- loads HoGs with respectively 5x5, 10x10, 20x20 cell sizes.

  6. histograms folder contains histograms representing distribution of labels based on different label categories. These diagrams help you better understand extracted data.

Distribution of classes

all_labels_distribution Labels were combined from train and test sets.

About

CROHME dataset extractor for OFFLINE-text-recognition task.

http://blog.mathocr.com


Languages

Language:Python 100.0%