Program List OCR
1. What is this?
Program List OCR is a peice of OCR (Optical Character Recognition) software which is specific to computer program listings published in 1980s.
It converts scanned program listing images into plain text. You can convert this text into an emulator’s input file, e.g. casette tape image
Program List OCR is a compilation of following open souce softwares.
-
Tesseract (OCR engine)
-
gImageReader (GUI frontend)
And it also contains special OCR language model files.
-
BASIC (Generic Basic Langauge)
-
N6X-BASIC (BASIC for NEC PC-6001 (Japanese))
-
Hexadecimal machine language
BASIC (bas) model is for generic BASIC language listings. It recognizes ASCII printable characters.
N6X-BASIC (n6x) model is dedicated to NEC PC-6001. It recognizes ASCII and PC-6001’s Japanese and graphical characters.
Hexadecimal machine language(hex) model recognizes only hexadecimal numbers and some extra characters(0-9,A-F,Sum). Therefore it achives better accuracy.
2. Disclaimer
OCR accuracy depends on quality of printing, scanning, used printer model, and fonts.
3. How to use
3.1. Install
-
Double click ProgramListOCRSetup….exe
-
Follow the instructions of the installer.
Warning
|
Install fails if %TEMP% directory is assigned to RAM Disk. Please detach RAM Disk before installing. |
3.3. Operating instructions
3.3.1. Scan images and preprocessing
Scan program listings with your document scanner. (Taking picture with camera is not recommended)
Preferred image format is:
-
600dpi
-
grayscale
-
TIFF or high quality JPEG
You should deskew and normalize your images.
Scantailor is recommended for preprocessing.
For better accuracy you can thicken printed characters with GIMP.
Open image with GIMP and do "Filters" → "Generic" → "Erode".
After that it is recommended to convert to a 1-bit (black and white) image, eg. GIMP’s "Colors" → "Threshold"
3.3.2. Open images
Click the folder button on the left pane to open images.
Select image(s) to recognize in file select dialog.
3.3.3. Select region and recognize
Warning
|
Do the following steps page by page. |
Drag mouse and select region to recognize.
You can add region by ctrl + mouse drag.
When you have finished selecting your region, click "Recognize Selection" to execute recognition.
"Recognize Selection" is pull-down button and you can select language here.
If you want to recognize BASIC program listing choose "bas". If you want to recognize hexadecimal program listing, e.g. MLX format choose "hex".
Make sure to set the Language data locations to System-wide paths within the settings.
It takes a very long time to recognize.
3.3.4. Reformat text
When recognition is finished, recognized text appears in the right pane.
Copy and paste the text to your favorite text editor.
At this point line-wrapping is not recognized.
You have to concatenate wrapped lines manually.
3.3.5. Finish
Reformatted text can be used for your emulator’s input, e.g.casette tape image file.
Enjoy!
4. Developer information
4.1. License
Licence of bundled softwares are as follows.
- Tesseract
-
-
Apache License 2.0
-
- gImageReader
-
-
GNU General Public License v3.0
-
Scripts in this repository are modified version of Tesseract and licensed under Apache License 2.0, same as Tesseract.