JuliaIO / EzXML.jl

XML/HTML handling tools for primates

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when parsing tag attributes starting with @

jvkerckh opened this issue · comments

Greetings,

for my current project I need to be able to parse attributes that can start with @. However, parsehtml throws up a warning and doesn't parse the attribute at all while parsexml outright throws an error.

Example:

julia> htmlsnip = "<p @foo=\"bar\">content</p>"
"<p @foo=\"bar\">content</p>"

Using parsehtml:

julia> htmlsnip |> parsehtml
┌ Warning: XMLError: error parsing attribute name from HTML parser (code: 68, line: 1)
└ @ EzXML ~/.julia/packages/EzXML/ZNwhK/src/error.jl:95
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000005cb8ad0>))

Printing the result shows the attribute is not parsed:

julia> htmlsnip |> parsehtml |> prettyprint
┌ Warning: XMLError: error parsing attribute name from HTML parser (code: 68, line: 1)
└ @ EzXML ~/.julia/packages/EzXML/ZNwhK/src/error.jl:95
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <p>content</p>
  </body>
</html>

Using parsexml:

julia> htmlsnip |> parsexml
┌ Warning: caught 4 errors; showing the first one
└ @ EzXML ~/.julia/packages/EzXML/ZNwhK/src/error.jl:79
ERROR: XMLError: error parsing attribute name from XML parser (code: 68, line: 1)
Stacktrace:
 [1] throw_xml_error()
   @ EzXML ~/.julia/packages/EzXML/ZNwhK/src/error.jl:87
 [2] macro expansion
   @ ~/.julia/packages/EzXML/ZNwhK/src/error.jl:52 [inlined]
 [3] parsexml(xmlstring::String)
   @ EzXML ~/.julia/packages/EzXML/ZNwhK/src/document.jl:80
 [4] |>(x::String, f::typeof(parsexml))
   @ Base ./operators.jl:911
 [5] top-level scope
   @ REPL[77]:1

I'm using Julia v1.8.0 and EzXML v1.1.0, with no other packages in the environment.

I had the same problem today. As I always traverse the whole document I could mask the '@' char and replace it afterwards. Depending on your goal you could do something similar.

function parse_vue_html(html)
  doc_string = replace(html, "@"=>"__vue-on__")
  empty!(EzXML.XML_GLOBAL_ERROR_STACK)
  doc = Logging.with_logger(Logging.SimpleLogger(stdout, Logging.Error)) do
    EzXML.parsehtml(doc_string).root
  end
  # remove the html -> body levels
  replace(parse_elem(first(eachelement(first(eachelement(doc))))), "__vue-on__" => "@")
end

Note that the parser parse_elem() replaces the instances of __vue-on__ that occur as attribute name.