Are HTML entities being parsed correctly? Hex vs decimal
yrps opened this issue · comments
Here is a segment of xhtml from an epub, extracted by epub2txt itself:
<p class="noindentz">“Hardware, says bunnie, is a world without secrets: if you go deep enough, even the most important key is expressed in silicon or fuses. bunnie’s is a world without mysteries, only unexplored spaces. This is a look inside a mind without peer.”</p>
Here is how it's rendered in a conventional epub reader, text copied and pasted:
“Hardware, says bunnie, is a world without secrets: if you go deep enough,
even the most important key is expressed in silicon or fuses. bunnie’s is a
world without mysteries, only unexplored spaces. This is a look inside a mind
without peer.”
—EDWARD SNOWDEN
And here is the output of epub2txt -w 100
:
舠Hardware, says bunnie, is a world without secrets: if you go deep enough, even the most
important key is expressed in silicon or fuses. bunnie舗s is a world without mysteries, only
unexplored spaces. This is a look inside a mind without peer.舡
舒EDWARD SNOWDEN
For example, the curvy right quote denoted by ”
is being rendered as a hanzi character,
https://www.fileformat.info/info/unicode/char/8221/index.htm, when it should be https://www.fileformat.info/info/unicode/char/201d/index.htm.
(8221)_10 == (201D)_16
For reference, HTML entities that have &#x
refer to hexadecimal code points, otherwise &#
is decimal:
https://www.w3.org/TR/xhtml1/#h-4.12
https://www.w3.org/TR/html40/charset.html#h-5.3.1
I think this issue is now fixed. Please re-open if it is not.