Generic XML adapter

Question

Generic XML adapter

blackerby opened this issue 8 months ago · comments

William Blackerby commented 8 months ago

Is your feature request related to a problem? Please describe.
Currently, I am not able to query XML files using Shillelagh. I have conducted Google searches, searches of project documentation, and a search of the Apache Superset Slack but have not found an off-the-shelf solution.

Describe the solution you'd like
I would like to provide source data for an Apache Superset chart from an XML file not on a local file system but available from a URL.

Describe alternatives you've considered
The only other thing I can think of would be to read the XML file into a Pandas DataFrame and query that (cf. #388 and related Slack conversation), which I admit I have not yet tried.

Additional context
The issue and Slack conversation referenced above established a need for a custom adapter. I expect my issue could be solved by a customer adapter, too. I am happy to put in the learning and work to develop such an adapter, but I want to make sure I'm not missing something obvious before I head down that path.

Beto Dealmeida · Answer 1 · Thu Oct 19 2023 22:56:33 GMT+0800 (China Standard Time)

@blackerby do you have an example of the XML response you'd want to query?

The generic JSON adapter allows the user to define a JSONPath expression; it shouldn't be hard to duplicate it and have a XML adapter using XPath.

William Blackerby · Answer 2 · Fri Oct 20 2023 08:43:39 GMT+0800 (China Standard Time)

I suspected duplicating the generic JSON adapter for XML would be the way to go. Right now I'm working on a different custom adapter (your tutorial is awesome by the way, thanks so much for the good docs), but when that's in good shape I can try to get started on this or pitch in if someone else wants to start work on this.

I've pasted the response to a GET request to the following URL below:

https://api.congress.gov/v3/bill/118?format=xml&offset=0&limit=2&api_key=<API_KEY_HERE>

Note: this specific API offers a JSON response option, but I've got other use cases that only offer XML, so this is just offered for the sake of example.

<?xml version="1.0" encoding="utf-8"?>
<api-root>
   <bills>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            416
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/416?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Sullivan.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:41Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            415
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/415?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Samsel.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:40Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
   </bills>
   <pagination>
      <count>
         10503
      </count>
      <next>
         https://api.congress.gov/v3/bill/118?offset=2&amp;limit=2&amp;format=xml
      </next>
   </pagination>
   <request>
      <congress>
         118
      </congress>
      <contentType>
         application/xml
      </contentType>
      <format>
         xml
      </format>
   </request>
</api-root>

Beto Dealmeida · Answer 3 · Fri Oct 20 2023 12:06:23 GMT+0800 (China Standard Time)

Thanks! I started working on the XML adapter today, I'll test it against https://api.congress.gov/.

Beto Dealmeida · Answer 4 · Sat Oct 21 2023 00:53:23 GMT+0800 (China Standard Time)

@blackerby how do you see the format of the response? For the endpoint above, eg, should each row be a string with the XML:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
<bill>
    <congress>118</congress>
    <type>SRES</type>
    ...
</bill>
(1 row in 0.00s)

Or should it return a JSON representation of the data? (so it can be processed with the JSON functions in SQLite) Something like:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
{"congress": 118, "type": "SRES", ...}
(1 row in 0.00s)

Even better, we could explode the payload to columns and have:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
  congress  type    ...  latestAction
----------  ------       --------------------------------------------------------
       118  SRES         {"actionDate": "2023-10-18", "text": "Submitted in ..."}
(1 row in 0.00s)

Beto Dealmeida · Answer 5 · Sat Oct 21 2023 01:16:52 GMT+0800 (China Standard Time)

Are XML attributes important? Or do we care more about the text?

William Blackerby · Answer 6 · Sat Oct 21 2023 04:35:18 GMT+0800 (China Standard Time)

To your first question, I think the third option (exploding the payload to columns) is the way to go. Then columns with JSON in them (like the latestAction column) can be further processed with SQLite's JSON functions.

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

Beto Dealmeida · Answer 7 · Sat Oct 21 2023 05:09:27 GMT+0800 (China Standard Time)

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

<foo bar="baz">hi</foo>

What should we map that to?

{"foo": "hi"} is what I have currently working.
{"foo": {"text": "hi", "@bar": "baz"}} is a common format, but seems too verbose.
{"foo": "hi", "foo:bar": "baz"} is a more concise option.

William Blackerby · Answer 8 · Sat Oct 21 2023 06:42:14 GMT+0800 (China Standard Time)

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

cwegener · Answer 9 · Sat Oct 21 2023 13:05:07 GMT+0800 (China Standard Time)

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

I think defusedxml is the sane drop-in replacement for the built-in xml.etree package.

https://github.com/tiran/defusedxml

Beto Dealmeida · Answer 10 · Sun Oct 22 2023 20:33:35 GMT+0800 (China Standard Time)

@cwegener thanks for the tip on defusedxml!

Beto Dealmeida · Answer 11 · Sun Oct 22 2023 20:34:41 GMT+0800 (China Standard Time)

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

@blackerby, I released 1.2.8 with a simple generic XML adapter that only cares about text. If we need we can later implement a different algorithm that exposes XML attributes, and have a way of specifying which one should be used.

William Blackerby · Answer 12 · Sun Oct 22 2023 20:39:04 GMT+0800 (China Standard Time)

That is great, thanks @betodealmeida. I'll play with the new release this week and open a new issue if/when access to XML attributes becomes a challenge. I'm also looking forward to digging into the commit that added the XML adapter -- seems like a great way to learn.