JuliaIO / EzXML.jl

XML/HTML handling tools for primates

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

XPath not working as expected with findall

anandijain opened this issue · comments

I cannot seem to make findall work as intended. According to https://www.w3schools.com/xml/xpath_syntax.asp, it seems like the Xpath //ci should get compartment, k1, S1, but findall returns empty Node.

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <apply>
    <times/>
    <ci> compartment 
    </ci>
    <ci> k1 
    </ci>
    <ci> S1 
    </ci>
  </apply>
</math>
julia> findall("//ci", xml)
EzXML.Node[]

Your document has a namespace. See here

In particular:

There is a caveat on the combination of XPath and namespaces: if a document contains elements with a default namespace, you need to specify its prefix to the find* function. For example, in the following example, the root element and its descendants have a default namespace "http://www.foobar.org", but it does not have its own prefix. In this case, you need to assign a prefix to the namespance when finding elements in the namespace:

julia> doc = parsexml("""
       <parent xmlns="http://www.foobar.org">
           <child/>
       </parent>
       """)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fdc67710030>))

julia> findall("/parent/child", doc.root)  # nothing will be found
0-element Array{EzXML.Node,1}

julia> namespaces(doc.root)  # the default namespace has an empty prefix
1-element Array{Pair{String,String},1}:
 "" => "http://www.foobar.org"

julia> ns = namespace(doc.root)  # get the namespace
"http://www.foobar.org"

julia> findall("/x:parent/x:child", doc.root, ["x"=>ns])  # specify its prefix as "x"
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[child]@0x00007fdc6774c990>)

So for yours:

julia> str = """
              <math xmlns="http://www.w3.org/1998/Math/MathML">
                <apply>
                  <times/>
                  <ci> compartment
                  </ci>
                  <ci> k1
                  </ci>
                  <ci> S1
                  </ci>
                </apply>
              </math>
              """
"<math xmlns=\"http://www.w3.org/1998/Math/MathML\">\n  <apply>\n    <times/>\n    <ci> compartment\n    </ci>\n    <ci> k1\n    </ci>\n    <ci> S1\n    </ci>\n  </apply>\n</math>\n"

julia> xml = parsexml(str)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000558771efacd0>))

julia> ns = namespace(xml.root)
"http://www.w3.org/1998/Math/MathML"

julia> findall("//x:ci", xml.root, ["x"=>ns])
3-element Vector{EzXML.Node}:
 EzXML.Node(<ELEMENT_NODE[ci]@0x00005587718755e0>)
 EzXML.Node(<ELEMENT_NODE[ci]@0x000055877175d460>)
 EzXML.Node(<ELEMENT_NODE[ci]@0x0000558772b529d0>)

Thanks!

@kescobo I have another case here that I think would be better not lost to Slackhole.

julia> str = """<cn type="e-notation" cellml:units="molar_per_minute">5   <sep/>-2</cn>"""
"<cn type=\"e-notation\" cellml:units=\"molar_per_minute\">5   <sep/>-2</cn>"

julia> parsexml(str)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000002880610>))

julia> parsexml(str).root
ERROR: AssertionError: isempty(XML_GLOBAL_ERROR_STACK)

I'm wondering if I can ignore the undefined namespace, basically remove all cellml: attributes from a Document, or what the standard workaround here is.

I suppose I could do findall("//*[@cellml:*]", node) and then just delete! the attributes.
This is sort of an annoying hack, as other formats might have other namespaces.

If you have any thoughts, I'd really appreciate it!

Alas, I have no idea. The only reason I knew how to answer the previous question is because I'd run into the same issue before and someone else helped me. I'm no expert! Good luck :-/