norvig / paip-lisp

Lisp code for the textbook "Paradigms of Artificial Intelligence Programming"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PDF version contains mangled glyphs

atomontage opened this issue · comments

Both in code snippets (e.g. page 433) and in the text itself (e.g. page 105).

The version available here is better in that regard:
http://gen.lib.rus.ec/book/index.php?md5=E64BEAAE4866C4F055A051D40C76BD5A

commented

@jawbroken The explanation is OCR, where they have opted to "re-set" the text based on the recognised results (probably using some automatic font parameter recognition). This is never the right approach, and as you point out, is a very high risk for dangerous similar-glyph replacements that occurs with djvu-like schemes. They should have left the original bitmaps. It's doubly ironic that a Norvig text should be victim to such automated mangling.

Actual scans of course do not have the errors, and the typesetting quality is invariably better than the "automatic" re-setting the OCR software does. (I've also encountered both this mangled PDF and the better scanned version in the past.)

I'd be willing to help transcribe (I have a good amount of experience in TeX and some with LaTeX, former professional typesetter and proofreader). Maybe a group of volunteers could divide up the pages for transcription... and proofreading...

commented

Yes, it does happen in lossy compression of bitmaps (the JBIG/djvu case), but in this case we know it's OCR errors, because the type has been re-set by the OCR software (zoom in to confirm). You can even see the words where the OCR has given up, and just dropped in bitmaps. This really is a pointless feature and should never be used, especially not on technical work. (Unfortunately I've seen plenty of this, e.g. even more egregious examples: https://twitter.com/Symbo1ics/status/934889147826884611 via http://infolab.stanford.edu/pub/cstr/reports/cs/tr/66/43/CS-TR-66-43.pdf and sadly plenty more ruined documents on that site).

For those interested, the fonts used in PAIP are:

  • Optima on title page and major headings, headers and footers
  • Palatino for body text and other headings
  • Letter Gothic for code

It has to be said that the original design and typesetting is really quite good, and appears to be non-PostScript phototypesetting or digital setting, with commensurately high quality fonts.

Thanks @atomontage, that is indeed a much better version; replacing the old files now ... oops, github has a 25MB limit and this file is 42 MB. I'll try to condense it.

Great!