kevinboone / epub2txt2

A simple command-line utility for Linux, for extracting text from EPUB documents.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spine items with URL-encoded hrefs are not handled correctly

kevinboone opened this issue · comments

Although unusual, it's legitimate for the XHTML documents in an EPUB to have filenames containing whitespace and punctuation characters. When these files are referenced in the manifest/spine in content.opf, they should be URL-encoded. Often this isn't the case but, when it is, epub2txt fails because it doesn't decode the URL. So if we have

<item href="foo%20bar.xhtml"/>

the program ends up looking for a file that is actually called "foo%20bar.xhtml" instead of decoding it to "foo bar.xhtml".