Are HTML entities being parsed correctly? Hex vs decimal

Question

Are HTML entities being parsed correctly? Hex vs decimal

yrps opened this issue 4 years ago · comments

Here is a segment of xhtml from an epub, extracted by epub2txt itself:

<p class="noindentz">&#8220;Hardware, says bunnie, is a world without secrets: if you go deep enough, even the most important key is expressed in silicon or fuses. bunnie&#8217;s is a world without mysteries, only unexplored spaces. This is a look inside a mind without peer.&#8221;</p>

Here is how it's rendered in a conventional epub reader, text copied and pasted:

“Hardware, says bunnie, is a world without secrets: if you go deep enough,
even the most important key is expressed in silicon or fuses. bunnie’s is a
world without mysteries, only unexplored spaces. This is a look inside a mind
without peer.”
—EDWARD SNOWDEN

And here is the output of epub2txt -w 100:

舠Hardware, says bunnie, is a world without secrets: if you go deep enough, even the most 
important key is expressed in silicon or fuses. bunnie舗s is a world without mysteries, only 
unexplored spaces. This is a look inside a mind without peer.舡 

舒EDWARD SNOWDEN

For example, the curvy right quote denoted by ” is being rendered as a hanzi character,
https://www.fileformat.info/info/unicode/char/8221/index.htm, when it should be https://www.fileformat.info/info/unicode/char/201d/index.htm.

(8221)_10 == (201D)_16

For reference, HTML entities that have &#x refer to hexadecimal code points, otherwise &# is decimal:
https://www.w3.org/TR/xhtml1/#h-4.12
https://www.w3.org/TR/html40/charset.html#h-5.3.1

Kevin Boone · Answer 1 · Mon Jan 24 2022 19:44:38 GMT+0800 (China Standard Time)

I think this issue is now fixed. Please re-open if it is not.