PRQL / pyprql

Python extensions for PRQL

Home Page:https://pyprql.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

python syntax for remote sources?

cboettig opened this issue · comments

It appears the current interfaces provides us a pure-python syntax only for pandas data.frames. It would be great to have a more general syntax for remote connections, (including remote parquet as in #150 )

Following up from #150 (comment) maybe a syntax similar to how existing SQL-based modules work (i.e. basically establish some kind of connection object and then pass the prql a string to that) could be used here?

Very open to this AFAIU — would it be possible to describe in a bit more detail?

Are you thinking of something like duckdb.prql as a parallel to duckdb.sql?

Yup, I think something like duckdb.prql would be a great option, though I realize PRQL is intended to be backend agnostic so maybe that seems less than ideal.

Maybe the most obvious approach is merely to provide bindings to SQL translation (Maybe that's already possible and I've just overlooked it)? This is the approach taken in @eitsupi 's very nice prqlr package, e.g. https://eitsupi.github.io/prqlr/articles/knitr.html, which supports both the R equivalent of jupyter-magic with various backends, including in-memory data.frames, while for scripts, it just generates SQL strings which a user could pass to duckdb.sql() or whatever other backend db engine they prefer, e.g. something like

import duckdb, pyprql
con = duckdb.connect()

query = pyprql.compile(f'''
from invoices
filter invoice_date >= @1970-01-16
derive [
  transaction_fees = 0.8,
  income = total - transaction_fees
]
filter income > 1
group customer_id (
  aggregate [
    average total,
    sum_income = sum income,
    ct = count,
  ]
)
''')

con.execute(query).df()

If we want to simply compile, pyprql is not necessary.
It is complicated, but just use prql-python (prqlr is a package that implements almost the same functionality as prql-python in R)

Since this is just a string transformation, it can easily be used in combination with duckdb.

import duckdb
import polars as pl
import prql_python as prql

df = pl.DataFrame({'a': 42})
opts = prql.CompileOptions(target="sql.duckdb")

duckdb.sql(prql.compile("from df", options=opts)).pl()

(Example of Polars integration borrowed from https://duckdb.org/2023/02/13/announcing-duckdb-070.html)

I agree that it would be great to have these examples in the documentation.

Thanks @eitsupi , this is great, precisely the workflow I was looking for!

Yes, I found the documentation a bit confusing on this, e.g. the PRQL README mentions only pyprql, I didn't realize it also contained bindings for most languages, but otherwise I think we can close this.

Great, we didn't explain that, so I added to the docs in #154.

Please reopen / add a new issue for anything!