python syntax for remote sources?
cboettig opened this issue · comments
It appears the current interfaces provides us a pure-python syntax only for pandas data.frames. It would be great to have a more general syntax for remote connections, (including remote parquet as in #150 )
Following up from #150 (comment) maybe a syntax similar to how existing SQL-based modules work (i.e. basically establish some kind of connection object and then pass the prql a string to that) could be used here?
Very open to this AFAIU — would it be possible to describe in a bit more detail?
Are you thinking of something like duckdb.prql
as a parallel to duckdb.sql
?
Yup, I think something like duckdb.prql
would be a great option, though I realize PRQL is intended to be backend agnostic so maybe that seems less than ideal.
Maybe the most obvious approach is merely to provide bindings to SQL translation (Maybe that's already possible and I've just overlooked it)? This is the approach taken in @eitsupi 's very nice prqlr
package, e.g. https://eitsupi.github.io/prqlr/articles/knitr.html, which supports both the R equivalent of jupyter-magic with various backends, including in-memory data.frames, while for scripts, it just generates SQL strings which a user could pass to duckdb.sql()
or whatever other backend db engine they prefer, e.g. something like
import duckdb, pyprql
con = duckdb.connect()
query = pyprql.compile(f'''
from invoices
filter invoice_date >= @1970-01-16
derive [
transaction_fees = 0.8,
income = total - transaction_fees
]
filter income > 1
group customer_id (
aggregate [
average total,
sum_income = sum income,
ct = count,
]
)
''')
con.execute(query).df()
If we want to simply compile, pyprql
is not necessary.
It is complicated, but just use prql-python
(prqlr
is a package that implements almost the same functionality as prql-python
in R)
Since this is just a string transformation, it can easily be used in combination with duckdb
.
import duckdb
import polars as pl
import prql_python as prql
df = pl.DataFrame({'a': 42})
opts = prql.CompileOptions(target="sql.duckdb")
duckdb.sql(prql.compile("from df", options=opts)).pl()
(Example of Polars integration borrowed from https://duckdb.org/2023/02/13/announcing-duckdb-070.html)
I agree that it would be great to have these examples in the documentation.
Thanks @eitsupi , this is great, precisely the workflow I was looking for!
Yes, I found the documentation a bit confusing on this, e.g. the PRQL README mentions only pyprql
, I didn't realize it also contained bindings for most languages, but otherwise I think we can close this.
Great, we didn't explain that, so I added to the docs in #154.
Please reopen / add a new issue for anything!