The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display.
deahhh opened this issue · comments
deahhh commented
using EzXML
doc = EzXML.parsehtml("<body><p>hello</p><p>**</p><p>深圳</p></body>")
primates = root(doc)
for p in eachelement(primates)
println(nodecontent(p))
end
julia draft.jl
Out put:
helloä¸åæ·±å³
deahhh commented
the bug will be fixed roughly by replacing "encoding" with "utf-8" in julia.
function parsehtml(htmlstring::AbstractString)
if isempty(htmlstring)
throw(ArgumentError("empty HTML string"))
end
url = C_NULL
encoding = C_NULL
options = 1
doc_ptr = @check ccall(
(:htmlReadMemory, libxml2),
Ptr{_Node},
(Cstring, Cint, Cstring, Cstring, Cint),
htmlstring, sizeof(htmlstring), url, "utf-8", options) != C_NULL
show_warnings()
return Document(doc_ptr)
end
GregorE commented
We just had the same problem using Genie.jl
and boiled down the problem root to the new version of XML2_jll.jl
v2.11.5
. Pinning that package to the previously released version v2.10.4
makes the problem disappear:
pkg> add XML2_jll@2.10.4
Note that versions 2.11.0
to 2.11.4
were not provided by XML2_jll.jl
, so these can not be immediately tested.
Then:
julia> using EzXML
julia> doc = EzXML.parsehtml("<body><p>hello</p><p>**</p><p>深圳</p></body>")
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000001afee70>))
julia> primates = root(doc)
EzXML.Node(<ELEMENT_NODE[html]@0x0000000001c9f680>)
julia> for p in eachelement(primates)
println(nodecontent(p))
end
hello**深圳
Of course this is also a problem when using umlauts.
Not sure whether this is already (or should be) in scope by of https://gitlab.gnome.org/GNOME/libxml2/-/issues