kevinboone / epub2txt2

A simple command-line utility for Linux, for extracting text from EPUB documents.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Are HTML entities being parsed correctly? Hex vs decimal

yrps opened this issue · comments

commented

Here is a segment of xhtml from an epub, extracted by epub2txt itself:

<p class="noindentz">&#8220;Hardware, says bunnie, is a world without secrets: if you go deep enough, even the most important key is expressed in silicon or fuses. bunnie&#8217;s is a world without mysteries, only unexplored spaces. This is a look inside a mind without peer.&#8221;</p>

Here is how it's rendered in a conventional epub reader, text copied and pasted:

“Hardware, says bunnie, is a world without secrets: if you go deep enough,
even the most important key is expressed in silicon or fuses. bunnie’s is a
world without mysteries, only unexplored spaces. This is a look inside a mind
without peer.”
—EDWARD SNOWDEN

And here is the output of epub2txt -w 100:

舠Hardware, says bunnie, is a world without secrets: if you go deep enough, even the most 
important key is expressed in silicon or fuses. bunnie舗s is a world without mysteries, only 
unexplored spaces. This is a look inside a mind without peer.舡 

舒EDWARD SNOWDEN 

For example, the curvy right quote denoted by &#8221; is being rendered as a hanzi character,
https://www.fileformat.info/info/unicode/char/8221/index.htm, when it should be https://www.fileformat.info/info/unicode/char/201d/index.htm.

(8221)_10 == (201D)_16

For reference, HTML entities that have &#x refer to hexadecimal code points, otherwise &# is decimal:
https://www.w3.org/TR/xhtml1/#h-4.12
https://www.w3.org/TR/html40/charset.html#h-5.3.1

I think this issue is now fixed. Please re-open if it is not.