From image preprocessing to generating corresponding text files to each scanned city directory image from Manhattan and Brooklyn
This work encompasses the use of Tesseract for Optical Character Recognition (OCR) process over historical City Directories (CD) from Manhattan and Brooklyn.
This project requires the links to Manhattan CD's scraped from the NYPL website's API, and also the individual pages from Brooklyn CD PDFs.
git clone https://github.com/CenterForSpatialResearch/hnyc_cd_ocr/
cd hnyc_cd_ocr/
npm install -g nypl-spacetime/hocr-detect-columns
To succesfully have this project work in your system, you would require the procedure (third command above) to detecting columns (and wrapping lines) work in the system. For the process of detecting columns and wrapping lines into valid entries, we have borrowed the process built and graciously open sourced by New York Public Library. More details regarding it , and code
- The '.ipynb' file (eg: MN_1850-51.ipynb) of the corresponding borough and year is present in this repository. This files performs the OCR process of the city directory and outputs the result in 'tess_output' folder's year-named folder.
- The 'extracting_entries.ipynb' script converts hOCR format of Tesseract to text
- The '.ipynb' file (eg: CRF_1850.ipynb) named CRF produces the CRF output saved in result json file inside the CRF output folder (eg: MN_1850_CRF_output.json)
- Cropping the image: This is an essential step that narrows down the image to what is necessary input. With lesser noisy input, Tesseract gives cleaner output. The process involves finding the 4 locations in the image that form the corners of the cropped image. (calculated as a percentage of current image size + pixel indexing starts from top left corner with (0,0)). Used PILLOW library’s image.crop command: image.crop((left, top, right, bottom))
- Correcting the orientation (rotating): Depending on the side the page appears in the book (right/left side page), the orientation had to be corrected. This might primarily be because of the cameras that were set up in place for either side of the CD book while capturing the pages. The orientation was same for all pages on a particular side of each book. (usually either 90 or 270). Command used: image.transpose(Image.ROTATE_90).
- Binary thresholding of the image: After having tried both global thresholding and
(particularly the Sauvola Thresholding), local thresholding seemed to give better results. It is said to work better for historical documents because of issues such as ink leakage, erosion, etc.
-
Project homepage: https://github.com/CenterForSpatialResearch
-
Repository: https://github.com/CenterForSpatialResearch/hnyc_cd_ocr/
-
Related projects:
- NYPL project: https://github.com/nypl-spacetime/city-directories
Folders in this repository:
- Scraping folder contains JavaScript code for scraping TIF image links from NYPL for all years surrounding 1850, 1880 and 1910.
- Excel sheet with the names of the 10 years whose data is being taken from digitized New York City directories.
Latest Runs:
City Directory | OCR | Data Cleaning | CRF |
---|---|---|---|
MN, 1850-51 | yes | yes | yes |
MN, 1880-81 | yes | yes | yes |
BK, 1850-51 | yes |