CenterForSpatialResearch / hnyc_cd_ocr

Mapping Historical New York: Text Recognition Process for NYC City Directories

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HNYC City Directory OCR Process using Tesseract v5

From image preprocessing to generating corresponding text files to each scanned city directory image from Manhattan and Brooklyn

This work encompasses the use of Tesseract for Optical Character Recognition (OCR) process over historical City Directories (CD) from Manhattan and Brooklyn.

Initial Requirements

This project requires the links to Manhattan CD's scraped from the NYPL website's API, and also the individual pages from Brooklyn CD PDFs.

How to get this project work on your system

git clone https://github.com/CenterForSpatialResearch/hnyc_cd_ocr/
cd hnyc_cd_ocr/
npm install -g nypl-spacetime/hocr-detect-columns

To succesfully have this project work in your system, you would require the procedure (third command above) to detecting columns (and wrapping lines) work in the system. For the process of detecting columns and wrapping lines into valid entries, we have borrowed the process built and graciously open sourced by New York Public Library. More details regarding it here, and code here

Workflow of the Scripts

  • The '.ipynb' file (eg: MN_1850-51.ipynb) of the corresponding borough and year is present in this repository. This files performs the OCR process of the city directory and outputs the result in 'tess_output' folder's year-named folder.
  • The 'extracting_entries.ipynb' script converts hOCR format of Tesseract to text
  • The '.ipynb' file (eg: CRF_1850.ipynb) named CRF produces the CRF output saved in result json file inside the CRF output folder (eg: MN_1850_CRF_output.json)

Workflow of the OCR

  • Cropping the image: This is an essential step that narrows down the image to what is necessary input. With lesser noisy input, Tesseract gives cleaner output. The process involves finding the 4 locations in the image that form the corners of the cropped image. (calculated as a percentage of current image size + pixel indexing starts from top left corner with (0,0)). Used PILLOW library’s image.crop command: image.crop((left, top, right, bottom))
  • Correcting the orientation (rotating): Depending on the side the page appears in the book (right/left side page), the orientation had to be corrected. This might primarily be because of the cameras that were set up in place for either side of the CD book while capturing the pages. The orientation was same for all pages on a particular side of each book. (usually either 90 or 270). Command used: image.transpose(Image.ROTATE_90).
  • Binary thresholding of the image: After having tried both global thresholding and local thresholding (particularly the Sauvola Thresholding), local thresholding seemed to give better results. It is said to work better for historical documents because of issues such as ink leakage, erosion, etc.

Links

Folders in this repository:

  • Scraping folder contains JavaScript code for scraping TIF image links from NYPL for all years surrounding 1850, 1880 and 1910.
  • Excel sheet with the names of the 10 years whose data is being taken from digitized New York City directories.

Latest Runs:

City Directory OCR Data Cleaning CRF
MN, 1850-51 yes yes yes
MN, 1880-81 yes yes yes
BK, 1850-51 yes

About

Mapping Historical New York: Text Recognition Process for NYC City Directories


Languages

Language:Jupyter Notebook 99.7%Language:HTML 0.2%Language:Python 0.0%Language:JavaScript 0.0%