norvig / paip-lisp

Lisp code for the textbook "Paradigms of Artificial Intelligence Programming"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upcoming: a better scan?

pronoiac opened this issue · comments

I'm thinking about getting or making a better scan.

It feels like the ebook copies I've seen come from an earlier draft than what made it into print, and that the print copy benefited from more proofreading. We've fixed mistakes that were present in ebooks, not artifacts of the ebook to markdown conversion, and not present in the print copy. A better scan could result in better OCR, and fixing a bunch of mistakes. (Though it might not! OCR is hard.)

For scanning:

For OCR, I have thoughts and notes, but it feels like a different discussion / issue, and I don't want to get too far ahead of myself.

Any thoughts or suggestions?

A better scan could result in better OCR, and fixing a bunch of mistakes. (Though it might not! OCR is hard.)

Thoughts below...

Do you imagine doing a diff between the current Markdown files and the new OCR scan? A diff would probably be more efficient than manually reading every word in, say, a print version and comparing to the current Markdown version on Github.

However, there may be enough errors in the new OCR to make doing a diff significantly frustrating and time-consuming. A few years ago, I did a destructive scan of several books, and I did OCR on the PDFs; ligatures (e.g., fi) tended to not be "OCR-d" correctly. Perhaps OCR tech has improved; I do not know.

The PAIP printed version is <1000 pages, so a destructive scan would cost $10 through 1DollarScan. The expensive part would be the time to compare/update the Markdown files.

I'm building the TIFLIC scanner from the DIY Book Scanner forum.

Nice. :)

Do you imagine doing a diff between the current Markdown files and the new OCR scan? A diff would probably be more efficient than manually reading every word in, say, a print version and comparing to the current Markdown version on Github.

Oh, yes, diff is the way to go. (The thought of manual comparison makes me shudder a bit. It's almost 1k pages!) I use git diff --word-diff, git add --patch for interactivity, and sometimes Sublime Merge, which presents something like Github does, and allows adding chunks on a more granular level.

San Francisco Public Library has document scanners with feeders, and I've heard Fedex can cut off book spines; it would take some time, but it could allow for retries.

Oh, yes, diff is the way to go.

It will be interesting if new techniques for OCR are better than the hand-edited older version. I suspect that the hand-edited version will be better, so the diff will be noisy with bad OCR... but I am not sure.

Have you tried MaGit (https://magit.vc/) as a Git interface?

I have pretty good 600dpi greyscale scans, scanned at the local library. At 3 gigabytes, it's large enough to not fit in a Github Release as one file - the limit's 2 gigabytes. In 300dpi and/or black & white, it would fit; it would be nicer to read, but it might impact OCR.

I found a comparison of OCR engines, which recommended Google, Amazon, and ABBYY, though Tesseract is close, and I can run it locally. I'm considering throwing chapter 10, specifically, at each of these. It's the first chapter that hasn't gotten scrutiny.

Re: Magit: not yet. I tried Emacs a while ago, but didn't stick with it long enough to get comfortable.

I posted the 300dpi scan as a release.

Running diff against the OCR of this yielded a lot of corrections: #146 .

Using delta or another OCR engine might be useful later, but this is a good stopping point for now.