The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display.

Question

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display.

deahhh opened this issue a year ago · comments

using EzXML

doc = EzXML.parsehtml("<body><p>hello</p><p>**</p><p>深圳</p></body>")

primates = root(doc)

for p in eachelement(primates)
    println(nodecontent(p))
end

julia draft.jl

Out put:
helloä¸åæ·±å³

deahhh · Answer 1 · Tue Oct 10 2023 13:27:59 GMT+0800 (China Standard Time)

the bug will be fixed roughly by replacing "encoding" with "utf-8" in julia.

function parsehtml(htmlstring::AbstractString)
    if isempty(htmlstring)
        throw(ArgumentError("empty HTML string"))
    end
    url = C_NULL
    encoding = C_NULL
    options = 1
    doc_ptr = @check ccall(
        (:htmlReadMemory, libxml2),
        Ptr{_Node},
        (Cstring, Cint, Cstring, Cstring, Cint),
        htmlstring, sizeof(htmlstring), url, "utf-8", options) != C_NULL
    show_warnings()
    return Document(doc_ptr)
end

GregorE · Answer 2 · Tue Oct 31 2023 20:21:33 GMT+0800 (China Standard Time)

We just had the same problem using Genie.jl and boiled down the problem root to the new version of XML2_jll.jl v2.11.5. Pinning that package to the previously released version v2.10.4 makes the problem disappear:

pkg> add XML2_jll@2.10.4

Note that versions 2.11.0 to 2.11.4 were not provided by XML2_jll.jl, so these can not be immediately tested.

Then:

julia> using EzXML

julia> doc = EzXML.parsehtml("<body><p>hello</p><p>**</p><p>深圳</p></body>")
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000001afee70>))

julia> primates = root(doc)
EzXML.Node(<ELEMENT_NODE[html]@0x0000000001c9f680>)

julia> for p in eachelement(primates)
           println(nodecontent(p))
       end
hello**深圳

Of course this is also a problem when using umlauts.

Not sure whether this is already (or should be) in scope by of https://gitlab.gnome.org/GNOME/libxml2/-/issues