cldellow / datasette-parquet

Add DuckDB, Parquet, CSV and JSON lines support to Datasette

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sanity checking the handling of WHERE statements with datasette

mrchrisadams opened this issue · comments

Hi there.

First of all, thanks for making datasette-parquet - I was very happy to discover it exists.

I'm using it on a project to play around with some recently published open data and whenever I try adding a "WHERE" clause to a query that would go to duckdb I'm seeing an error.

When I run load a page without a where clause, all is well.

However once I add a where clause like so:

Screenshot 2023-07-18 at 16 27 51

I end up with a confusing "Binder" error that I'm not sure how to debug:

Screenshot 2023-07-18 at 16 27 56

I'm working to get a test case together, but in the meantime, is the rewrite.py the file to check to see if there is any logic that it might be causing this?

https://github.com/cldellow/datasette-parquet/blob/main/datasette_parquet/rewrite.py#L80-L81

I see some logging statements, and knowing where to look would help for me to create a PR to fix it once I understand the cause of the issue.

Okay, I've had a go at looking at the SQL to see if that's an issue.

I've added some logging, and simplified where clause slightly

# this is the execute defined in winging_it.ProxyCursor.execute()
def execute(self, sql, parameters=None):
        
        with open('duckdb.log', 'w') as f:
            f.write('sql={}\n'.format(sql))
            f.write('parameters={}\n'.format(parameters))

            
        print('# params={} sql={}'.format(parameters, sql))
        sql = rewrite(sql)

        with open('duckdb.updated.log', 'w') as f:
            f.write('sql={}\n'.format(sql))
            f.write('parameters={}\n'.format(parameters))

        sql, parameters = fixup_params(sql, parameters)

        print('## params={} sql={}'.format(parameters, sql))
        t = time.time()
        
        rv = self.cursor.execute(sql, parameters)
        #print('took {}'.format(time.time() - t))
        return rv

Here's the SQL before it gets converted for duckdb:

select
  [Datetime (UTC)],
  Country,
  [Zone Name],
  [Zone Id],
  [Carbon Intensity gCO₂eq/kWh (direct)],
  [Carbon Intensity gCO₂eq/kWh (LCA)],
  [Low Carbon Percentage],
  [Renewable Percentage],
  [Data Source],
  [Data Estimated],
  [Data Estimation Method]
from
  result_zstd
 where 
  [Datetime (UTC)] < "2022-01-01"
limit
  10

Here's the updated sql

SELECT "Datetime (UTC)",
    Country,
    "Zone Name",
    "Zone Id",
    "Carbon Intensity gCO₂eq/kWh (direct)",
    "Carbon Intensity gCO₂eq/kWh (LCA)",
    "Low Carbon Percentage",
    "Renewable Percentage",
    "Data Source",
    "Data Estimated",
    "Data Estimation Method"
FROM result_zstd
WHERE "Datetime (UTC)" < "2022-01-01"
LIMIT 10

The error I see is as follows in the logs:

duckdb.BinderException: Binder Error: Referenced column "2022-01-01" not found in FROM clause!
Candidate bindings: "result_zstd.Country"

For some reason, it looks like duckdb is looking at the wrong part of the query, but I'm not sure why this is happening. I'll get a reproducible test case tonight.

Hey, thanks for the kind comments & detailed report!

What happens if you try single quotes rather than double quotes for the timestamp / date?

SQLite has some unfortunate quoting behaviour that dates from them trying to conform to what an ancient MySQL version did: https://www.sqlite.org/quirks.html#double_quoted_string_literals_are_accepted

If "a quoted string" doesn't actually resolve to a valid identifier, it'll interpret it as a literal (that is, as if you had typed 'a quoted string'). DuckDB doesn't do this fallback behaviour - double quotes are strictly for identifiers, and single quotes are strictly for literals. In this case, since you're specifying a literal, single quotes should be used.

(You might still hit other issues with the integration, of course! But hopefully this gets you one step closer.)

Haha! I think the double quote / single quote string was the issue!

Using the combo of square brackets for column names, and single quote for values in the datasette SQL interface worked without the errors, and I can reliably reproduce the error by switching to use double quotes in the SQL interface.

select
  [Datetime (UTC)],
  Country,
  [Zone Name],
  [Zone Id],
  [Carbon Intensity gCO₂eq/kWh (direct)],
  [Carbon Intensity gCO₂eq/kWh (LCA)],
  [Low Carbon Percentage],
  [Renewable Percentage],
  [Data Source],
  [Data Estimated],
  [Data Estimation Method]
from
  result_zstd
where 
  [Datetime (UTC)] BETWEEN '2022-01-02' AND '2022-01-03' 
  AND [Zone Id] = 'DK'
limit
  101

Result! Thank you!

With this in mind, the error that is returned is pretty cryptic.

Do you have any suggestions for catching the use of double strings in this scenario to warn people?

You could do it with some code a bit like the following, and a better named exception:

import re

def has_double_quoted_string(text):
    pattern = r'"([^"]*)"'
    matches = re.findall(pattern, text)
    return len(matches) > 0

if has_double_quoted_string(sql):
    raise Exception('Sorry, double quoted strings work in SQLite, but DuckDB handles them differently. Please use the single quotes for literal values')

The problem is, I don't think you can guarantee that the rest of datasette is avoiding the use of double quoted strings.

The code above also throws an exception when you have a string representation of a whole query, that begins and ends with double quote strings. For the code above to work, you'd need to know to check for when a double quoted string is only being used for literal, and I don't know how to do that.

I'm not sure how to do that, but I'll have a look around.

okay, this looks like it'll be a hairy regular expression. It might be simpler to hook into the SQL interface to warn users in writing. I'll see how hard that is.

do you have any suggestions for catching the use of double strings in this scenario to warn people.

I really like the idea of trying to nudge the user towards a solution when the error happens.

A rough and ready approach (which is the standard this project aims for) might be to wrap ProxyCursor.execute and ProxyConnection.execute's invocations of the underlying DuckDB execute call.

If they catch a duckdb.BinderException error whose text hints at a missing column:

duckdb.BinderException: Binder Error: Referenced column "2022-01-01" not found in FROM clause!

they can pull out the referenced column, see if it's quoted with double quotes and if it's present in the raw SQL with double quotes. If so, it can throw a new error that wraps the original one and points them towards checking for double quotes vs single quotes.

I don't think you can guarantee that the rest of datasette is avoiding the use of double quoted strings

Well, yes and no. I've opened a PR (simonw/datasette#2004) to fix all of its current misuses of double quotes. In the meantime, rewrite.py has a bunch of code to fix up the existing misuses.

you'd need to know to check for when a double quoted string is only being used for literal, and I don't know how to do that

Yes, I think this is the crux of it. SQLite's silent transformation is bad, as the authors themselves note. A query author typoing a column name can result in a query silently "succeeding". That is, the query will run without errors, but do something entirely unintended, as the clause will now refer to a fixed literal instead of a column name.

Ok, this approach sounds a lot more sensible,

If they catch a duckdb.BinderException error whose text hints at a missing column:

duckdb.BinderException: Binder Error: Referenced column "2022-01-01" not found in FROM clause!

they can pull out the referenced column, see if it's quoted with double quotes and if it's present in the raw SQL with double quotes. If so, it can throw a new error that wraps the original one and points them towards checking for double quotes vs single quotes.

Something like this seems to work and is catching the exception with the message whilst letting everything else through:

        import re
        from duckdb import BinderException 

        try:
            rv = self.cursor.execute(sql, parameters)
        except BinderException as exc:
            # should return 'Binder Error: Referenced column "LITERAL_IN_DOUBLE_QUOTES" '
            referenced_column_message = exc.args[0].split(' not found in FROM clause')[0]

            # a regular expression to find the literal wrapped by the double quoted string
            pattern = r'(?<![A-Za-z0-9_])"(?:[^"]*(?:"[^"]*)*[^"]*)"(?![A-Za-z0-9_])'

            # find the instance of the double quoted literal in the message, so you can refer to it in
            # the new exception message
            matches = re.findall(pattern, referenced_column_message)
            if matches:
                raise Exception((
                        f"It looks like you are using a double quoted string for a value at: {matches[0]}. "
                        "To make this work with DuckDB, wrap it in single quoted strings instead."
                    )
                )

Because we are now just checking that an error string is formatted a given way, this doesn't look too hard to write tests around.

I'll make a PR later this week. thanks @cldellow !

OK, @cldellow - I think this PR is in ok shape to look over now.

Thanks for the PR! I've merged it, keeping the gitpod stuff. I don't use gitpod myself, but it seems reasonable to include it.

I've released v0.6.1 to pypi with the improvements + a shout-out to you in the release notes.