Upcoming: a better scan?

Question

Upcoming: a better scan?

pronoiac opened this issue 3 years ago · comments

I'm thinking about getting or making a better scan.

It feels like the ebook copies I've seen come from an earlier draft than what made it into print, and that the print copy benefited from more proofreading. We've fixed mistakes that were present in ebooks, not artifacts of the ebook to markdown conversion, and not present in the print copy. A better scan could result in better OCR, and fixing a bunch of mistakes. (Though it might not! OCR is hard.)

For scanning:

I tried the book scanner at Noisebridge, but the book's wide enough that reflections were an issue on the margins
I'm building the TIFLIC scanner from the DIY Book Scanner forum; it uses a camera to digitize. (I have other things I want to scan anyway.)
I mail a copy for a destructive scan; 1DollarScan looks decent

For OCR, I have thoughts and notes, but it feels like a different discussion / issue, and I don't want to get too far ahead of myself.

Any thoughts or suggestions?

Hannah Miller · Answer 1 · Tue Jan 25 2022 22:59:46 GMT+0800 (China Standard Time)

A better scan could result in better OCR, and fixing a bunch of mistakes. (Though it might not! OCR is hard.)

Thoughts below...

Do you imagine doing a diff between the current Markdown files and the new OCR scan? A diff would probably be more efficient than manually reading every word in, say, a print version and comparing to the current Markdown version on Github.

However, there may be enough errors in the new OCR to make doing a diff significantly frustrating and time-consuming. A few years ago, I did a destructive scan of several books, and I did OCR on the PDFs; ligatures (e.g., ﬁ) tended to not be "OCR-d" correctly. Perhaps OCR tech has improved; I do not know.

The PAIP printed version is <1000 pages, so a destructive scan would cost $10 through 1DollarScan. The expensive part would be the time to compare/update the Markdown files.

I'm building the TIFLIC scanner from the DIY Book Scanner forum.

Nice. :)

James Young · Answer 2 · Mon Jan 31 2022 12:07:02 GMT+0800 (China Standard Time)

Do you imagine doing a diff between the current Markdown files and the new OCR scan? A diff would probably be more efficient than manually reading every word in, say, a print version and comparing to the current Markdown version on Github.

Oh, yes, diff is the way to go. (The thought of manual comparison makes me shudder a bit. It's almost 1k pages!) I use git diff --word-diff, git add --patch for interactivity, and sometimes Sublime Merge, which presents something like Github does, and allows adding chunks on a more granular level.

San Francisco Public Library has document scanners with feeders, and I've heard Fedex can cut off book spines; it would take some time, but it could allow for retries.

Hannah Miller · Answer 3 · Wed Mar 02 2022 23:19:48 GMT+0800 (China Standard Time)

Oh, yes, diff is the way to go.

It will be interesting if new techniques for OCR are better than the hand-edited older version. I suspect that the hand-edited version will be better, so the diff will be noisy with bad OCR... but I am not sure.

Have you tried MaGit (https://magit.vc/) as a Git interface?

James Young · Answer 4 · Tue Mar 08 2022 14:34:41 GMT+0800 (China Standard Time)

I have pretty good 600dpi greyscale scans, scanned at the local library. At 3 gigabytes, it's large enough to not fit in a Github Release as one file - the limit's 2 gigabytes. In 300dpi and/or black & white, it would fit; it would be nicer to read, but it might impact OCR.

I found a comparison of OCR engines, which recommended Google, Amazon, and ABBYY, though Tesseract is close, and I can run it locally. I'm considering throwing chapter 10, specifically, at each of these. It's the first chapter that hasn't gotten scrutiny.

Re: Magit: not yet. I tried Emacs a while ago, but didn't stick with it long enough to get comfortable.

James Young · Answer 5 · Tue Apr 19 2022 13:17:04 GMT+0800 (China Standard Time)

I posted the 300dpi scan as a release.

Running diff against the OCR of this yielded a lot of corrections: #146 .

Using delta or another OCR engine might be useful later, but this is a good stopping point for now.