PRQL / pyprql

Python extensions for PRQL

Home Page:https://pyprql.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for python magic

rbpatt2019 opened this issue · comments

Now that we have the rust backend, lets make magic happen for ipython/jupyter!

The challenge here will be handling connections to different database types. fugue is probably the best bet for operating on in memory data frames or csv files, as we can route out to dask or spark and get cracking fast. But it can't connect to existing database files (someone call me on this if I'm wrong). DuckDB is brilliant, but only natively supports duckdb instances. There are extensions to convert other types (sqlite, mysql, etc) to duckdb, but none are supported in the Python API. There is an sqlalchemy driver for duckdb though.

This is all a really long way of saying that I think we should probably check the connection input. If it's an in-memory dataframe or a csv file, use fugue[spark,sql] to run the compiled sql query. If it's anything else, including duckdb, just try connecting with sqlalchemy and using their methods. What are everyone's thoughts?

Great idea!

sqlalchemy sounds like a good plan to allow flexibility.

How do we envisage passing things like connection parameters?

Are there any popular magics out there for using SQL? Potentially we could piggyback on those. IIRC bigquery had a bigquery-specific one (but their connection model is a bit different from traditional DBs).

What would you see the return type being? A pandas dataframe?

Wthl the ipython/jupyter cell magic, everything following %%prql but on the same line as it is passed as a string. Conceivably, additional arguments beyond connections strings could be parsed out that way. There is a whole setup that basically mimics argparse for use with magics, but I'd lean away from that where possible. These magics work best the simpler they are.

There are definitely several magics existing. There's an %%sql magic from the python-sql package. If we were to build on one, that's probably the best. You pass a connection string, then an sql query, and hey presto, results. It's already got expansive functionality and, since it uses sqlalchemy under the hood, its supports any format supported by them, including duckdb. They natively support everything we'd want, except reading from csv's. Would just need to see about adding our parsing into it...

fugue also have a magic %%fsql which is brilliant for handling sql with in memory pandas data frames. What makes this particularly challenging is namespace limitations. The fsql query expects the table name to be the name of the in-memory dataframe. But it actually takes a lot of work to safely move that between scopes in python.

My inclination is to prefer pandas data frames as the output (supported by both plugins above). Then, people can either panda's their data, or do more queries.

Those look great! Having the equivalent of %%sql as %%prql would be really good — definitely the best way for people to use prql so far.

And possibly this isn't too difficult™ if we take python-sql as a dependency and pass most args along, though it's rarely that easy :)

That's what I'm investigating today. The first two arguments of every python cell magic are line and cell. The first is a string of everything on the line, the second is a string of everything else. If we can write a decorator that just wraps their function and compiles prql to sql where appropriate, it'd be rather straightforward. (Famous last words, I know!)

It was, dare I say it, more or less that simple. Branch fix043 should have a minimal working magic that you can activate by calling import pyprql.magic in either Python or Jupiter. It's still super rough - currently, additional args can't be passed as line magic, which breaks some of the functionality (example below)

Just a word on features as well - their set is all we need, I think. If you have a dataframe in memory and an established connection to a database, then you can add that dataframe to the database with %prql --persist df. Since they wrap sqlalchemy, if we ship with the sqlalchemy duckdb extension, then behold in memory querying of data frames! ...Once I get arguments on line magics sorted :p

That sounds great!

How are you getting it working? I tried in Jupyter & IPython but it didn't immediately work. I can look more if helpful but figure I'm just doing something wrong:

In [1]: import pyprql.magic

In [2]: %prql
UsageError: Line magic function `%prql` not found.

In [3]: %%prql
   ...: derive x: 3
   ...:
   ...:
UsageError: Cell magic `%%prql` not found.

Oops sorry, I made some major refactors and created a new local branch without pushing it. Just deleted the remote and pushed the new, so try now!

No prob at all. I just pulled the latest.

I'm confident it's me doing something wrong — I seem to recall issues installing jupyter extensions in the past.

What are your install steps? I'm doing:

 ~/w/PyPrq
❯ git rev-parse HEAD
2401f0190b136b6ddb77c48159c4064b3edbe377

 ~/w/PyPrql
❯ pip install -e .

and then

ipython

In [1]: %%prql
   ...: derive x = 3
   ...:
UsageError: Cell magic `%%prql` not found.

In [2]: %prql derive x = 3
UsageError: Line magic function `%prql` not found.

Sorry the absence! Back at work on this now. There should now be (roughly) functional. Pair of caveats: 1) the line magic %prql can currently only be used for arguments and connection strings. 2) The cell magic only returns results if you store to a local variable. So, assuming you have an in-memory dataframe named data, then the following will let use the magic:

import pyprql.magic

%prql duckdb:///:memory:

%prql --persist data

%%prql results <<
from data
filter freg > 100
select [ food_name ]

prints(results)

Few more comments on in-progress bits...we shouldn't need to pass the cell magic to a local variable to see the results (that's what results << is doing in the cell magic). But for reasons I haven't figured out, leaving the local variable out returns nothing, regardless of the query. Also, the original %sql line magic supports single line queries. However, the parsing to determine whether the line is a connection string, options, or a query is extensive, and we'd need to implement most of it as we'd need to identify where the prql query is in order to parse it to SQL.

What's confusing me about the cell magic is that manually running the query with %%sql returns results without local assignment, just as I'd expect. Not sure why that disappears in our wrapper...

There's a pull request for this now. See #44