JuliaIO / EzXML.jl

XML/HTML handling tools for primates

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display.

deahhh opened this issue · comments

commented
using EzXML

doc = EzXML.parsehtml("<body><p>hello</p><p>**</p><p>深圳</p></body>")

primates = root(doc)

for p in eachelement(primates)
    println(nodecontent(p))
end
julia draft.jl

Out put:
hello中åæ·±å³

commented

the bug will be fixed roughly by replacing "encoding" with "utf-8" in julia.

function parsehtml(htmlstring::AbstractString)
    if isempty(htmlstring)
        throw(ArgumentError("empty HTML string"))
    end
    url = C_NULL
    encoding = C_NULL
    options = 1
    doc_ptr = @check ccall(
        (:htmlReadMemory, libxml2),
        Ptr{_Node},
        (Cstring, Cint, Cstring, Cstring, Cint),
        htmlstring, sizeof(htmlstring), url, "utf-8", options) != C_NULL
    show_warnings()
    return Document(doc_ptr)
end

We just had the same problem using Genie.jl and boiled down the problem root to the new version of XML2_jll.jl v2.11.5. Pinning that package to the previously released version v2.10.4 makes the problem disappear:

pkg> add XML2_jll@2.10.4

Note that versions 2.11.0 to 2.11.4 were not provided by XML2_jll.jl, so these can not be immediately tested.

Then:

julia> using EzXML

julia> doc = EzXML.parsehtml("<body><p>hello</p><p>**</p><p>深圳</p></body>")
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000001afee70>))

julia> primates = root(doc)
EzXML.Node(<ELEMENT_NODE[html]@0x0000000001c9f680>)

julia> for p in eachelement(primates)
           println(nodecontent(p))
       end
hello**深圳

Of course this is also a problem when using umlauts.

Not sure whether this is already (or should be) in scope by of https://gitlab.gnome.org/GNOME/libxml2/-/issues