eighttails / ProgramListOCR

OCR suite specialized for printed program listing (BASIC and HEX)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Program List OCR

1. What is this?

Program List OCR is a peice of OCR (Optical Character Recognition) software which is specific to computer program listings published in 1980s.

It converts scanned program listing images into plain text. You can convert this text into an emulator’s input file, e.g. casette tape image

Program List OCR is a compilation of following open souce softwares.

  • Tesseract (OCR engine)

  • gImageReader (GUI frontend)

And it also contains special OCR language model files.

  • BASIC (Generic Basic Langauge)

  • N6X-BASIC (BASIC for NEC PC-6001 (Japanese))

  • Hexadecimal machine language

BASIC (bas) model is for generic BASIC language listings. It recognizes ASCII printable characters.

N6X-BASIC (n6x) model is dedicated to NEC PC-6001. It recognizes ASCII and PC-6001’s Japanese and graphical characters.

Hexadecimal machine language(hex) model recognizes only hexadecimal numbers and some extra characters(0-9,A-F,Sum). Therefore it achives better accuracy.

2. Disclaimer

OCR accuracy depends on quality of printing, scanning, used printer model, and fonts.

3. How to use

3.1. Install

  1. Double click ProgramListOCRSetup…​.exe

  2. Follow the instructions of the installer.

Warning
Install fails if %TEMP% directory is assigned to RAM Disk.
Please detach RAM Disk before installing.

3.2. Start

Launch "Program List OCR" → "gImageReader" from the Start Menu.

START

3.3. Operating instructions

3.3.1. Scan images and preprocessing

Scan program listings with your document scanner. (Taking picture with camera is not recommended)
Preferred image format is:

  • 600dpi

  • grayscale

  • TIFF or high quality JPEG

You should deskew and normalize your images.
Scantailor is recommended for preprocessing.

For better accuracy you can thicken printed characters with GIMP.
Open image with GIMP and do "Filters" → "Generic" → "Erode".

BEFORE
Figure 1. Before "Erode"
AFTER
Figure 2. After "Erode"

After that it is recommended to convert to a 1-bit (black and white) image, eg. GIMP’s "Colors" → "Threshold"

3.3.2. Open images

Click the folder button on the left pane to open images.

OPEN

Select image(s) to recognize in file select dialog.

3.3.3. Select region and recognize

Warning

Do the following steps page by page.
Note: if you change pages before recognition, the selected regions will be cleared.

Drag mouse and select region to recognize.
You can add region by ctrl + mouse drag.

REGION

When you have finished selecting your region, click "Recognize Selection" to execute recognition.
"Recognize Selection" is pull-down button and you can select language here.
If you want to recognize BASIC program listing choose "bas". If you want to recognize hexadecimal program listing, e.g. MLX format choose "hex".
Make sure to set the Language data locations to System-wide paths within the settings.

RECOGNIZE

It takes a very long time to recognize.

3.3.4. Reformat text

When recognition is finished, recognized text appears in the right pane.

RESULT

Copy and paste the text to your favorite text editor.

At this point line-wrapping is not recognized.
You have to concatenate wrapped lines manually.

3.3.5. Finish

Reformatted text can be used for your emulator’s input, e.g.casette tape image file.
Enjoy!

4. Developer information

4.1. License

Licence of bundled softwares are as follows.

Tesseract
gImageReader

Scripts in this repository are modified version of Tesseract and licensed under Apache License 2.0, same as Tesseract.

About

OCR suite specialized for printed program listing (BASIC and HEX)

License:Apache License 2.0


Languages

Language:Shell 56.2%Language:Python 41.9%Language:Qt Script 1.9%