Open Source OCR for Large Collections of Scanned Documents - Art Rhyno
kba opened this issue · comments
Some points from @artunit's talk:
- concentrates on OCR of newspapers on microfilms/microfiches
- concentrates on ABBYY vs. Tesseract
- comparison slide: https://youtu.be/gcjCiS9pJ3A?t=1439
- mentions the Line Segment Detector: http://www.ipol.im/pub/art/2012/gjmr-lsd/
- mentions the Olena project: https://www.lrde.epita.fr/wiki/Olena
- distributed OCR with Hadoop
- mentions his repo https://github.com/artunit/ossocr
- discussion in the end
- backup, storage on hard drives
- OCRropus (based on Tesseract at this time): Python based, effective for book pages not newspapers
- What is Google's role in Tesseract?
- How to present that all in the end? Annotations for the users?