XPath not working as expected with findall

Question

XPath not working as expected with findall

anandijain opened this issue 3 years ago · comments

I cannot seem to make findall work as intended. According to https://www.w3schools.com/xml/xpath_syntax.asp, it seems like the Xpath //ci should get compartment, k1, S1, but findall returns empty Node.

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <apply>
    <times/>
    <ci> compartment 
    </ci>
    <ci> k1 
    </ci>
    <ci> S1 
    </ci>
  </apply>
</math>

julia> findall("//ci", xml)
EzXML.Node[]

anand jain commented 3 years ago

Thanks!

Kevin Bonham · Answer 1 · Wed Mar 10 2021 02:53:32 GMT+0800 (China Standard Time)

Your document has a namespace. See here

In particular:

There is a caveat on the combination of XPath and namespaces: if a document contains elements with a default namespace, you need to specify its prefix to the find* function. For example, in the following example, the root element and its descendants have a default namespace "http://www.foobar.org", but it does not have its own prefix. In this case, you need to assign a prefix to the namespance when finding elements in the namespace:
julia> doc = parsexml("""
       <parent xmlns="http://www.foobar.org">
           <child/>
       </parent>
       """)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fdc67710030>))

julia> findall("/parent/child", doc.root)  # nothing will be found
0-element Array{EzXML.Node,1}

julia> namespaces(doc.root)  # the default namespace has an empty prefix
1-element Array{Pair{String,String},1}:
 "" => "http://www.foobar.org"

julia> ns = namespace(doc.root)  # get the namespace
"http://www.foobar.org"

julia> findall("/x:parent/x:child", doc.root, ["x"=>ns])  # specify its prefix as "x"
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[child]@0x00007fdc6774c990>)

So for yours:

julia> str = """
              <math xmlns="http://www.w3.org/1998/Math/MathML">
                <apply>
                  <times/>
                  <ci> compartment
                  </ci>
                  <ci> k1
                  </ci>
                  <ci> S1
                  </ci>
                </apply>
              </math>
              """
"<math xmlns=\"http://www.w3.org/1998/Math/MathML\">\n  <apply>\n    <times/>\n    <ci> compartment\n    </ci>\n    <ci> k1\n    </ci>\n    <ci> S1\n    </ci>\n  </apply>\n</math>\n"

julia> xml = parsexml(str)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000558771efacd0>))

julia> ns = namespace(xml.root)
"http://www.w3.org/1998/Math/MathML"

julia> findall("//x:ci", xml.root, ["x"=>ns])
3-element Vector{EzXML.Node}:
 EzXML.Node(<ELEMENT_NODE[ci]@0x00005587718755e0>)
 EzXML.Node(<ELEMENT_NODE[ci]@0x000055877175d460>)
 EzXML.Node(<ELEMENT_NODE[ci]@0x0000558772b529d0>)

anand jain · Answer 2 · Fri Mar 12 2021 02:55:13 GMT+0800 (China Standard Time)

@kescobo I have another case here that I think would be better not lost to Slackhole.

julia> str = """<cn type="e-notation" cellml:units="molar_per_minute">5   <sep/>-2</cn>"""
"<cn type=\"e-notation\" cellml:units=\"molar_per_minute\">5   <sep/>-2</cn>"

julia> parsexml(str)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000002880610>))

julia> parsexml(str).root
ERROR: AssertionError: isempty(XML_GLOBAL_ERROR_STACK)

I'm wondering if I can ignore the undefined namespace, basically remove all cellml: attributes from a Document, or what the standard workaround here is.

I suppose I could do findall("//*[@cellml:*]", node) and then just delete! the attributes.
This is sort of an annoying hack, as other formats might have other namespaces.

If you have any thoughts, I'd really appreciate it!

Kevin Bonham · Answer 3 · Fri Mar 12 2021 03:07:16 GMT+0800 (China Standard Time)

Alas, I have no idea. The only reason I knew how to answer the previous question is because I'd run into the same issue before and someone else helped me. I'm no expert! Good luck :-/