Clean up .txt version; port to md or some other format

Question

Clean up .txt version; port to md or some other format

norvig opened this issue 6 years ago · comments

The .txt version has a lot of errors; I got it from the default Save as other / ...Text menu item in Acrobat. An automated tool could rejoin the lines that end in hyphens, and perhaps find missing spaces, as in programmingpractices and anunfortunate. Other errors would require significant human labor to clean up.

Roland Meertens · Answer 1 · Tue Feb 27 2018 07:58:29 GMT+0800 (China Standard Time)

In what format was the book originally formatted? Do you still have those original files somewhere? Original source -> something great is probably easier than original source -> pdf -> conversion -> something great...

Peter Norvig · Answer 2 · Tue Feb 27 2018 12:18:49 GMT+0800 (China Standard Time)

It is a sad, sad story ... The original files were in LaTeX, but the publisher (Morgan Kaufman) hired a design company to do final page layout/design/formatting, so I never had the final-final files. I got careless with my near-final files: during one transfer of machines the files were lost because I didn't realize that I needed to do tar --dereference to follow symbolic links. I wasn't too worried because I assumed I could always get the files from Morgan Kaufman if I needed them; however, when Morgan Kaufman was bought by Elsevier, they too managed to lose the files.

Roland Meertens · Answer 3 · Tue Feb 27 2018 16:42:28 GMT+0800 (China Standard Time)

That is indeed a pretty sad story... Someone on hackernews mentioned that there was a clearer PDF version available for ACM members (https://news.ycombinator.com/item?id=16470513). Would it be possible to upload that version instead of the version that is online right now? Might make the conversion a bit easier...

Alex Schroeder · Answer 4 · Wed Feb 28 2018 01:17:00 GMT+0800 (China Standard Time)

Can we use this issue to coordinate assigning the chapters to people and porting to Markdown? I'll volunteer for a chapter.

Toby · Answer 5 · Wed Feb 28 2018 01:30:12 GMT+0800 (China Standard Time)

I'll volunteer for a chapter, but please establish the Markdown guidelines first (e.g. how heading levels will correspond to the typographic levels in the book).

Peter Norvig · Answer 6 · Wed Feb 28 2018 01:40:09 GMT+0800 (China Standard Time)

Thank you for volunteering! I think we need to

Wait a few days to see if a better source pdf file emerges.
Establish guidelines as @qu1j0t3 says.
See if we can develop some automated tools to apply across all chapters.

Jean-Philippe Paradis · Answer 7 · Wed Feb 28 2018 03:23:56 GMT+0800 (China Standard Time)

Hello,

In the interest of potentially averting duplicate work, I thought I'd let you know that I am intent on making a "modern" mobile-friendly HTML version of PAIP. (I am not yet fully committed to this project, because it would indeed delay a very important project of mine again. I can't make any promises yet, I'm still in the exploration phase.)

I completed a very similar project a few months ago: starting from a much more primitive HTML version, I made a modern public domain version of chapters 5 and 6 of AMOP (the CLOS MOP specification), mostly by hand: https://clos-mop.hexstreamsoft.com/

So, given my current infrastructure, this new modern HTML conversion project should be relatively straightforward (hopefully!), although the volume is far, far more... voluminous. I'll go at this lone-wolf style, per my usual modus operandi.

I hope this information is deemed to be of some utility (and not too spammy).

edit: Oh geez, this book is around 1000 pages?? I didn't quite realize how massive this book is... I could eventually tackle such a project, but absolutely not anytime soon. I'm officially relegating my PAIP conversion project to the indeterminate future, then. Sorry for the noise.

Roland Meertens · Answer 8 · Wed Feb 28 2018 04:46:12 GMT+0800 (China Standard Time)

Just wanted to weigh in that I think @norvig is right, a better PDF version would make conversion easier... Also: three hours developing a semi-automatic tool will definitely outweigh three hours of manually correcting/hacking parts of the original together.

Also: if we end up with a good "ground-truth" version of the book I see a nice OCR formatting challenge for AI algorithms ;)

Mi Yu · Answer 9 · Wed Feb 28 2018 05:21:18 GMT+0800 (China Standard Time)

I hacked together a Python (3.6+) script to separate out different chapters from the text file. I wonder if this script can be used as the starting point for automated spell-checking, parsing for footnotes and code snippets, etc. in the future.

The script contains a dictionary that maps pdf page number to chapters. Excluding front-matter, preface, appendix, bibliography and index, there are 25 chapters in total. Considering the voluminous-ness of the book, I wonder if we could start another issue for people to claim their favorite chapters to work on.

Might I suggest we use GitHub-flavored Markdown for the purpose of this project? It seems to be one of the more popular flavors of Markdown, and we would have a variety of tools available to convert GFM to other document formats.

edit I used form feed character in the text file to delineate page boundaries. A recent pull request seems to have replaced them with whitespaces. Hmm..

Peter Norvig · Answer 10 · Wed Feb 28 2018 05:46:33 GMT+0800 (China Standard Time)

@rmeertens I checked out the ACM pdf -- it is different than the ones we've seen before, and at 91MB almost twice as big. Not sure if it is better.

I did upload an alternative txt file, PAIP-alt.txt which we can investigate to see if it is better. If it is, we want to make sure to capture the work that was done on the previous PAIP.txt

Alex Schroeder · Answer 11 · Wed Feb 28 2018 05:46:35 GMT+0800 (China Standard Time)

What I saw in converting a few PDF files to text for a wiki a few years back makes me thing that it will be hard to develop good tools to undo the non-systematic errors. But I'm willing to wait, no problem.

Peter Norvig · Answer 12 · Wed Feb 28 2018 09:19:31 GMT+0800 (China Standard Time)

Does someone want to take a more careful look at PAIP.txt and PAIP-alt.txt and weigh in on which is better? Or if they make different errors, any good ideas for a way to use both?

Mi Yu · Answer 13 · Wed Feb 28 2018 10:28:14 GMT+0800 (China Standard Time)

@norvig Here are the primary difficulties I see:

PAIP.txt has more spelling errors / wrong characters.
PAIP-alt.txt does not have empty newlines after paragraphs, which might be gnarly when converting to Markdown.

We could consider either of the solutions below:

Use PAIP.txt primarily, and run a script to detect spelling errors; when an error is detected, go to the corresponding location in PAIP-alt.txt, fetch the alternative word there, and prompt the editor for verification.
Use PAIP-alt.txt primarily, but iterate through PAIP.txt, looking for paragraph breaks. As far as I can see, the content of the lines in both files appears to match up, so we can insert paragraph breaks at appropriate locations in PAIP-alt.txt.

The first approach seems to be more labor intensive (we still need to manually confirm if the script finds the right word in the alternative file), so maybe we can use the second approach, and figure out how to work from there.

Alex Schroeder · Answer 14 · Wed Feb 28 2018 15:19:58 GMT+0800 (China Standard Time)

When I did some visual skimming, PAIP-alt.txt didn't seem to offer significant benefits. Paragraph breaks were mentioned. These also affect chapter breaks and page headers. Anything that is not English has to be fixed manually, I fear. "2 - f 2 = " or " 2 -f 2 =" are both bad. " 2 2 + " is slightly better than " 22+ " but both need manual intervention. Same goes for code blocks. Here's an example where somebody mistakenly removed the 4 from PAIP.txt:

> (+ 2 2)
4
>

vs.

> (+ 2 2)

>

But further down both use > (+ 2 2) ^ 4 instead of > (+ 2 2) ⇒ 4.

Another example where both variants need manual intervention:

> 'John =^ JOHN 
> '(John Q Public) =^ (JOHN Q PUBLIC) 
> '2 2 
> . => . 
> ' ( + 2 2) =.> (+ 2 2) 
> (+ 2 2) =^ 4 
> John ^ Error: ]OHN is not a bound variable 
> (John Q Public) ^ Error: JOHN is not a function

vs.

> 'John => JOHN 

> '(John Q Public) => (JOHN Q PUBLIC) 

> '2 2 

> . => . 

> '(+ 2 2) =.> (+ 2 2) 

> (+ 2 2) => 4 

> John ^ Error: ]OHN is not a bound variable 

> (John Q Public) ^ Error: JOHN is not a function

That's why I conclude that PAIP-alt.txt is not necessarily better and furthermore, since a lot of manual intervention is required simple tools might not suffice.

I think my process would consist of tons of interactive search and replace operations in Emacs, some keyboard macros, and a lot of actual reading and formatting and carefully comparing code with what the book says in addition to a lot of talking with others about what markup to use for quotes at chapter beginnings and stuff like that. Things that can't necessarily be automated and shared. Perhaps we can share some common search and replace operations once we have converted a chapter or two. I think we need this experience, first. Perhaps I should just pick a chapter and time myself, see how much time it takes per page?

GitHub flavoured Markdown, # for chapter headings like Chapter 1: Introduction to Lisp and ## for section headings like 1.1 Symbolic Computation. Create a new file per chapter, named PAIP-chapter01.md and the like.

I'm going to wait for the weekend before trying any of that but this would be my proposal.

James Young · Answer 15 · Wed Feb 28 2018 22:23:14 GMT+0800 (China Standard Time)

I couldn't get miyuchina's script to work, and I opted to instead add chapter headers, in a fork. It's still in one file, so ## seemed appropriate for chapters; I verified that string wasn't used elsewhere in the book. Apologies if I'm jumping the gun before there's a consensus or anything; I woke up and couldn't sleep, so I was fiddling around.

Mi Yu · Answer 16 · Wed Feb 28 2018 23:36:29 GMT+0800 (China Standard Time)

@pronoiac pull request #7 got rid of the line feed characters, which broke my script. I was using line feeds to identify page breaks, and using page numbers to separate chapters. Perhaps that was the issue you encountered?

Pradeep Bashyal · Answer 17 · Thu Mar 01 2018 00:19:29 GMT+0800 (China Standard Time)

I'd be interested in seeing PAIP converted to HTML5/epub format like that of SICP http://sarabander.github.io/sicp/ The github project is at https://github.com/sarabander/sicp . It seems to have been converted from texinfo format but I like the formatting/style of the result.

James Young · Answer 18 · Thu Mar 01 2018 09:52:26 GMT+0800 (China Standard Time)

@miyuchina: It wouldn't run, it likely was missing a module. I started working on line numbers when I realized it was simpler to annotate the text.

I then noticed Sublime Text was silently making changes - eating control characters or confused by an unusual encoding? I might try to straighten that up.

Alex Schroeder · Answer 19 · Sat Mar 03 2018 02:12:54 GMT+0800 (China Standard Time)

I'm going to take a look a chapter 5 and model it on chapter 1 in the docs subdirectory.

Alex Schroeder · Answer 20 · Sat Mar 03 2018 04:50:17 GMT+0800 (China Standard Time)

It took me about 3h to do 24 pages.

Alex Schroeder · Answer 21 · Sat Mar 03 2018 04:57:21 GMT+0800 (China Standard Time)

I had some formatting questions as I went through the text.

how should I format GPS, ELIZA, PARRY, etc?
there was something that looked like a typo with an extra closing parenthesis: "See section 3.6)" Fix it or leave it?
how do we treat references in general, things like "on page 127", or chapter references? I'd love to link them but I don't know how we'd want to do it
sometimes code blocks are Lisp code and sometimes they are REPL interactions. I used code fences and highlighted as "emacs" because that's what chapter1.md uses but I'm not sure I like it; then again, I don't have a better suggestion
in a few examples, code contained italics and I decided to not use HTML for these and just drop the italics, e.g. (variable . value) → (variable . value)

What do you think? Agree, disagree? We will have to put up a consolidated rule set eventually.

Nenad Ticaric · Answer 22 · Sat Mar 03 2018 05:53:59 GMT+0800 (China Standard Time)

@kensanata good work on chapter5! The code highlighting plugin has been changed so you should use lisp to highlight the code block instead of emacs

James Young · Answer 23 · Sat Mar 03 2018 12:44:22 GMT+0800 (China Standard Time)

I'm curious, @nticaric and @kensanata, how much different are the text in the chapters from the text within PAIP.txt?

I can see how to reference pages and chapters within one Markdown file, and how to fix up the links for a file per chapter, generated from that source file. Note, this handles linking, but there could be more discussion about appearances: should the page numbers be visible?

I'd thought I'd read a good argument for having links for page numbers, maybe in @Hexstream 's comment, since edited? Looking at a Markdown cheatsheet, we could add anchors with something like:

156 ELIZA-DIALOG WITH A MACHINE <a id="page-156"></a>

And use it with See [page 156](#page-156) for more details.

If we do the same with chapters, then splitting up the files is straightforward, and it's easier to figure out which file a certain page number is in and reference the appropriate chapter Markdown file.

This takes some processing / automation to split the book into chapters, but it seems like the easiest links to write.

Alex Schroeder · Answer 24 · Sat Mar 03 2018 15:38:59 GMT+0800 (China Standard Time)

For chapter 5, I removed the page headers and the page breaks. I believe we don't need the text to recreate the book, and I don't need a website that recreates the book. With the text, we can reflow it, create an ebook without page breaks, and many more interesting things besides, and in all those uses, the page breaks are an annoyance. That's why I think we should create anchors for the things the text links to and link to them, and get rid of cross-references by page number. That's what I think we should do.

Jean-Philippe Paradis · Answer 25 · Sun Mar 04 2018 02:21:39 GMT+0800 (China Standard Time)

@pronoiac My comment you're thinking of about page number links and stuff is this one. :)

James Young · Answer 26 · Sun Mar 04 2018 08:27:55 GMT+0800 (China Standard Time)

@kensanata: It's not hard to remove the page headers and breaks as we add some anchors.

I agree that it would be nice, down the line, if a reference to another page goes right to the appropriate paragraph. To me, it feels more like a "day 2" thing, after we have the book in readable Markdown, with less mangled text. I have misgivings about, in the first pass, manually adding anchors for referenced subject matter:

I prefer automation to manual work
I think my proposed solution is good enough for a first pass
time: there are dozens of them
if they're across chapters, then merge conflicts might result

Having a source file could make it easier to enforce consistency. Are we done with broad changes? Like, is exercise markup settled? What about footnotes?

For the purposes of tracking changes, I'm also not keen on removing trailing spaces (in a chapter, but not in the book), or making each paragraph one line. Seeing what's been changed can help other editors figure out what to do on other chapters.

Alex Schroeder · Answer 27 · Sun Mar 04 2018 17:39:09 GMT+0800 (China Standard Time)

I think the 3–4h I spent on this project are the limit of what I will do, unfortunately. I suspect that the easiest solution to the entire problem would be if we could find 20+ other volunteers to do something like it and one final editor to go through all the chapters and decide how cross-references should work. With all the talk of automation being the preferred solution I’m curious to see the tools people have been working on. Is there a place where I can look recent developments and progress other than the commit history of this project?

James Young · Answer 28 · Tue Mar 06 2018 09:12:45 GMT+0800 (China Standard Time)

@kensanata: Apologies if I’ve steamrolled over you, or disregarded your opinion; it wasn’t my intent. You fixed up a chapter in great time.

Locally, I've added markers for chapters and pages, and written scripts to make a file per chapter.

I haven't yet written scripts to merge an edited chapter back into the text, or handle footnotes.

I’ve done parts of chapters 10 and 11. Fixing up code - and some assembler - feels very slow. I think I'll kick the assembler down the road for now.

Alex Schroeder · Answer 29 · Tue Mar 06 2018 17:59:47 GMT+0800 (China Standard Time)

No worries! :) I’m just happy somebody is working on the issue. :)

Hourann · Answer 30 · Fri Mar 16 2018 20:24:43 GMT+0800 (China Standard Time)

Good news. I found a digital version of this book in safari

James Young · Answer 31 · Sat Mar 17 2018 23:28:52 GMT+0800 (China Standard Time)

@hourann: that's great news! It looks like the Safari version is (mostly) text, instead of scanned bitmaps.

I'm in favor of a new issue for ripping and converting the Safari version. I'd start work on it, but I don't have an active Safari account.

Deleted user · Answer 32 · Thu Apr 26 2018 22:25:33 GMT+0800 (China Standard Time)

@pronoiac There is a 10 day free trial option available here.

Hourann · Answer 33 · Fri Apr 27 2018 00:19:17 GMT+0800 (China Standard Time)

@hakano @pronoiac I can access the book in safari in fact. I just wonder if it is legal to crawl the book from safari and post it here. I have crawled it and saved in html format.

James Young · Answer 34 · Fri Apr 27 2018 02:34:54 GMT+0800 (China Standard Time)

The legal aspect makes me wary of talking about it here. Personally, I wouldn’t be comfortable openly sharing html or an epub I’d downloaded. Once I’d converted it into Markdown and checked for fingerprinting, I'd feel free to share.

@hourann, @hakano, wanna compare downloads?

James Young · Answer 35 · Wed Jul 04 2018 04:15:48 GMT+0800 (China Standard Time)

Just for visibility, there's a cleaner source available, and a separate issue to track work with it.

James Young · Answer 36 · Tue Feb 23 2021 08:23:30 GMT+0800 (China Standard Time)

#48 covered this; the Safari ebook, though imperfect, was a much better basis to work from.