Open Source OCR for Large Collections of Scanned Documents - Art Rhyno

Question

kba opened this issue 8 years ago · comments

Philipp Zumstein · Answer 1 · Sun Nov 06 2016 18:09:05 GMT+0800 (China Standard Time)

Some points from @artunit's talk:

concentrates on OCR of newspapers on microfilms/microfiches
concentrates on ABBYY vs. Tesseract
- comparison slide: https://youtu.be/gcjCiS9pJ3A?t=1439
mentions the Line Segment Detector: http://www.ipol.im/pub/art/2012/gjmr-lsd/
mentions the Olena project: https://www.lrde.epita.fr/wiki/Olena
distributed OCR with Hadoop
mentions his repo https://github.com/artunit/ossocr
discussion in the end
- backup, storage on hard drives
- OCRropus (based on Tesseract at this time): Python based, effective for book pages not newspapers
- What is Google's role in Tesseract?
- How to present that all in the end? Annotations for the users?