HNYC City Directory OCR Process using Tesseract v5

From image preprocessing to generating corresponding text files to each scanned city directory image from Manhattan and Brooklyn

This work encompasses the use of Tesseract for Optical Character Recognition (OCR) process over historical City Directories (CD) from Manhattan and Brooklyn.

Initial Requirements

This project requires the links to Manhattan CD's scraped from the NYPL website's API, and also the individual pages from Brooklyn CD PDFs.

How to get this project work on your system

git clone https://github.com/CenterForSpatialResearch/hnyc_cd_ocr/
cd hnyc_cd_ocr/
npm install -g nypl-spacetime/hocr-detect-columns

To succesfully have this project work in your system, you would require the procedure (third command above) to detecting columns (and wrapping lines) work in the system. For the process of detecting columns and wrapping lines into valid entries, we have borrowed the process built and graciously open sourced by New York Public Library. More details regarding it , and code

Workflow of the Scripts

The '.ipynb' file (eg: MN_1850-51.ipynb) of the corresponding borough and year is present in this repository. This files performs the OCR process of the city directory and outputs the result in 'tess_output' folder's year-named folder.
The 'extracting_entries.ipynb' script converts hOCR format of Tesseract to text
The '.ipynb' file (eg: CRF_1850.ipynb) named CRF produces the CRF output saved in result json file inside the CRF output folder (eg: MN_1850_CRF_output.json)

Workflow of the OCR

Cropping the image: This is an essential step that narrows down the image to what is necessary input. With lesser noisy input, Tesseract gives cleaner output. The process involves finding the 4 locations in the image that form the corners of the cropped image. (calculated as a percentage of current image size + pixel indexing starts from top left corner with (0,0)). Used PILLOW library’s image.crop command: image.crop((left, top, right, bottom))
Correcting the orientation (rotating): Depending on the side the page appears in the book (right/left side page), the orientation had to be corrected. This might primarily be because of the cameras that were set up in place for either side of the CD book while capturing the pages. The orientation was same for all pages on a particular side of each book. (usually either 90 or 270). Command used: image.transpose(Image.ROTATE_90).
Binary thresholding of the image: After having tried both global thresholding and (particularly the Sauvola Thresholding), local thresholding seemed to give better results. It is said to work better for historical documents because of issues such as ink leakage, erosion, etc.

Links

Project homepage: https://github.com/CenterForSpatialResearch
Repository: https://github.com/CenterForSpatialResearch/hnyc_cd_ocr/
Related projects:
- NYPL project: https://github.com/nypl-spacetime/city-directories

Folders in this repository:

Scraping folder contains JavaScript code for scraping TIF image links from NYPL for all years surrounding 1850, 1880 and 1910.
Excel sheet with the names of the 10 years whose data is being taken from digitized New York City directories.

Latest Runs:

City Directory	OCR	Data Cleaning	CRF
MN, 1850-51	yes	yes	yes
MN, 1880-81	yes	yes	yes
BK, 1850-51	yes

CenterForSpatialResearch / hnyc_cd_ocr