betodealmeida / shillelagh

Making it easy to query APIs via SQL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generic XML adapter

blackerby opened this issue · comments

Is your feature request related to a problem? Please describe.
Currently, I am not able to query XML files using Shillelagh. I have conducted Google searches, searches of project documentation, and a search of the Apache Superset Slack but have not found an off-the-shelf solution.

Describe the solution you'd like
I would like to provide source data for an Apache Superset chart from an XML file not on a local file system but available from a URL.

Describe alternatives you've considered
The only other thing I can think of would be to read the XML file into a Pandas DataFrame and query that (cf. #388 and related Slack conversation), which I admit I have not yet tried.

Additional context
The issue and Slack conversation referenced above established a need for a custom adapter. I expect my issue could be solved by a customer adapter, too. I am happy to put in the learning and work to develop such an adapter, but I want to make sure I'm not missing something obvious before I head down that path.

@blackerby do you have an example of the XML response you'd want to query?

The generic JSON adapter allows the user to define a JSONPath expression; it shouldn't be hard to duplicate it and have a XML adapter using XPath.

I suspected duplicating the generic JSON adapter for XML would be the way to go. Right now I'm working on a different custom adapter (your tutorial is awesome by the way, thanks so much for the good docs), but when that's in good shape I can try to get started on this or pitch in if someone else wants to start work on this.

I've pasted the response to a GET request to the following URL below:

https://api.congress.gov/v3/bill/118?format=xml&offset=0&limit=2&api_key=<API_KEY_HERE>

Note: this specific API offers a JSON response option, but I've got other use cases that only offer XML, so this is just offered for the sake of example.

<?xml version="1.0" encoding="utf-8"?>
<api-root>
   <bills>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            416
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/416?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Sullivan.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:41Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            415
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/415?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Samsel.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:40Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
   </bills>
   <pagination>
      <count>
         10503
      </count>
      <next>
         https://api.congress.gov/v3/bill/118?offset=2&amp;limit=2&amp;format=xml
      </next>
   </pagination>
   <request>
      <congress>
         118
      </congress>
      <contentType>
         application/xml
      </contentType>
      <format>
         xml
      </format>
   </request>
</api-root>

Thanks! I started working on the XML adapter today, I'll test it against https://api.congress.gov/.

@blackerby how do you see the format of the response? For the endpoint above, eg, should each row be a string with the XML:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
<bill>
    <congress>118</congress>
    <type>SRES</type>
    ...
</bill>
(1 row in 0.00s) 

Or should it return a JSON representation of the data? (so it can be processed with the JSON functions in SQLite) Something like:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
{"congress": 118, "type": "SRES", ...}
(1 row in 0.00s) 

Even better, we could explode the payload to columns and have:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
  congress  type    ...  latestAction
----------  ------       --------------------------------------------------------
       118  SRES         {"actionDate": "2023-10-18", "text": "Submitted in ..."}
(1 row in 0.00s) 

Are XML attributes important? Or do we care more about the text?

To your first question, I think the third option (exploding the payload to columns) is the way to go. Then columns with JSON in them (like the latestAction column) can be further processed with SQLite's JSON functions.

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

<foo bar="baz">hi</foo>

What should we map that to?

  • {"foo": "hi"} is what I have currently working.
  • {"foo": {"text": "hi", "@bar": "baz"}} is a common format, but seems too verbose.
  • {"foo": "hi", "foo:bar": "baz"} is a more concise option.

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

I think defusedxml is the sane drop-in replacement for the built-in xml.etree package.

https://github.com/tiran/defusedxml

@cwegener thanks for the tip on defusedxml!

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

@blackerby, I released 1.2.8 with a simple generic XML adapter that only cares about text. If we need we can later implement a different algorithm that exposes XML attributes, and have a way of specifying which one should be used.

That is great, thanks @betodealmeida. I'll play with the new release this week and open a new issue if/when access to XML attributes becomes a challenge. I'm also looking forward to digging into the commit that added the XML adapter -- seems like a great way to learn.