Support for duckdb remote filesystem access?

Question

Support for duckdb remote filesystem access?

cboettig opened this issue a year ago · comments

I'm working with duckdb queries in python like against a remote source, like so:

import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")

query = f'''
SELECT * 
FROM read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
LIMIT 10
'''

res = con.execute(query)
df = res.df()

This works rather nicely (this is a public dataset so should be reproducible), but I'd love to be able to write the queries in PRQL instead of SQL. Unfortunately, I cannot figure out the necessary syntax to get this work with pyprql (either in jupyter magic or via the pandas integration). Any pointers?

Maximilian Roos · Answer 1 · Tue Mar 14 2023 05:25:20 GMT+0800 (China Standard Time)

Thanks for the issue @cboettig .

This is possible, but should be friendlier.

Currently:

from s'SELECT * from read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")'
take 10

...compiles to...

WITH table_0 AS (
  SELECT
    *
  from
    read_parquet(
      "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*"
    )
)
SELECT
  *
FROM
  table_0 AS table_1
LIMIT
  10

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

If it's a parquet file, then there's no need for the read_parquet in DuckDB, and we can just have:

from `taxi_2019_04.parquet`
take 10

...which is just...

SELECT
  *
FROM
  taxi_2019_04.parquet
LIMIT
  10

I think probably we add a param to from that would parse to read_parquet for DuckDB, so the expression above would be:

from `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*` format:parquet
take 10

Any thoughts on that?

Carl Boettiger · Answer 2 · Tue Mar 14 2023 07:07:46 GMT+0800 (China Standard Time)

thanks, this looks great! Works fine in jupyter magic (I was surprised I didn't have to install & load the optional httpfs extension in duckdb first? maybe because it was already installed from my SQL example?).

%load_ext pyprql.magic
%prql duckdb:///:memory:

%%prql
from s'SELECT * from read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")'
take 10

Also, I still can't figure out the analogous syntax for use in a pure python script, i.e. with the pandas-based interface. Is that possible already as well?

I do like your proposed syntax above for from!

Maximilian Roos · Answer 3 · Tue Mar 14 2023 07:27:01 GMT+0800 (China Standard Time)

Also, I still can't figure out the analogous syntax for use in a pure python script, i.e. with the pandas-based interface. Is that possible already as well?

Hmmm, what would the SQL be for this? The default is to query the df — but maybe it allows joining in a parquet file?

I do like your proposed syntax above for from!

Great, let me open an issue for this, and we can close this when your questions are answered

Carl Boettiger · Answer 4 · Tue Mar 14 2023 07:49:40 GMT+0800 (China Standard Time)

Thanks! I think I am also hitting a deeper issue where the the PRQL translator seems to get confused. I could be doing something wrong, but it looks like it wants to have some knowledge of the table schema that it doesn't have. e.g.

%%prql results <<
from s'SELECT * from read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")'
filter 'phylum' == 'Chordata'
derive longitude = (decimallongitude | round 0)
take 4

query fails with:

[SQL: WITH table_1 AS (
  SELECT
    *
  from
    read_parquet(
      "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*"
    )
)
SELECT
  *,
  ROUND(*, 0) AS longitude
FROM
  table_1 AS table_0
WHERE
  * = 'Chordata'
LIMIT
  4]

Weirdly the SQL is incorrect, using * instead of the corresponding column names. If I try quoting the column names, it fails again, this time seeming to treat decimalongitude as VARCHAR, though it is floating point. Maybe I've just messed up the prql wrong, my pure-python SQL version would be:

query = f'''
  SELECT *
  FROM (
    SELECT ROUND(decimallongitude, 0) AS longitude
    FROM read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
    WHERE (phylum = 'Chordata')
  )
 LIMIT 4
'''
con.execute(query).df()

Maximilian Roos · Answer 5 · Tue Mar 14 2023 08:21:16 GMT+0800 (China Standard Time)

filter 'phylum' == 'Chordata'

This is always false, and in the latest version at https://prql-lang.org/playground/ specifically evaluates to false, rather than *.

Removing the quotes I think gets us what we want:

from s'SELECT * from read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")'
filter phylum == 'Chordata'
derive longitude = (decimallongitude | round 0)
take 4

WITH table_0 AS (
  SELECT
    *
  from
    read_parquet(
      "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*"
    )
)
SELECT
  *,
  ROUND(decimallongitude, 0) AS longitude
FROM
  table_0 AS table_1
WHERE
  phylum = 'Chordata'
LIMIT
  4

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

Carl Boettiger · Answer 6 · Tue Mar 14 2023 08:41:00 GMT+0800 (China Standard Time)

Sorry, something seems to be messed up when this is parsed in my pyprql in python though! What you show is what I'd expect, but not what I get. Note my output above shows ROUND(*, 0) and WHERE * = "Chordata'. Why the * where it should have the column names? Can you confirm this works in jupyter-magic python for you and not just in the playground translator?

Maximilian Roos · Answer 7 · Tue Mar 14 2023 08:45:08 GMT+0800 (China Standard Time)

Can you confirm it's using the query I have above, without the quotes around phylum? It's different from the query you have.

Carl Boettiger · Answer 8 · Tue Mar 14 2023 08:47:31 GMT+0800 (China Standard Time)

yes, here's what I see:

(note the data is public so this should work copy-pasted into any jupyter instance)

Maximilian Roos · Answer 9 · Tue Mar 14 2023 08:55:34 GMT+0800 (China Standard Time)

Thanks, you definitely are...

Could you try upgrading pyprql? I get the correct query on my end:

[nav] In [7]: %%prql results <<
         ...: from s'SELECT * from read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")'
         ...: filter phylum == 'Chordata'
         ...: derive longitude = (decimallongitude | round 0)
         ...: take 4
(duckdb.IOException) IO Error: Extension "/Users/maximilian//.duckdb/extensions/v0.7.1/osx_arm64/httpfs.duckdb_extension" not found.
Extension "httpfs" is an existing extension.

Install it first using "INSTALL httpfs".
[SQL: WITH table_0 AS (
  SELECT
    *
  from
    read_parquet(
      "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*"
    )
)
SELECT
  *,
  ROUND(decimallongitude, 0) AS longitude
FROM
  table_0 AS table_1
WHERE
  phylum = 'Chordata'
LIMIT
  4

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Carl Boettiger · Answer 10 · Tue Mar 14 2023 09:00:51 GMT+0800 (China Standard Time)

perfect, my bad, I should have thought of upgrading! looks like we are running nicely now!

Maximilian Roos · Answer 11 · Tue Mar 14 2023 09:18:16 GMT+0800 (China Standard Time)

Excellent! Please continue dropping issues — this has been helpful for both the lang and the python integration, and I'm keen to improve the integration.

Taras Novak · Answer 12 · Tue Mar 14 2023 21:30:24 GMT+0800 (China Standard Time)

easier way of running those queries in vscode: