'SECOND' is recognized as keyword word in the Databrics dialect

Question

'SECOND' is recognized as keyword word in the Databrics dialect

beipang opened this issue 2 months ago · comments

Bei Pang commented 2 months ago

Search before asking

I searched the issues and found no similar issues.

What Happened

We are migrating our SQL queries from Redshift to SparkSQL.

Our SQLFluff linter raises an error on the usage of the datediff function

In the datediff documentation, you can see that this function accpets "SECOND" as a unit.

However, the linter throws errors on our query with regard to the unit "SECOND"

Expected Behaviour

The linter should accept SECOND as a keyword and returns no error.

The same query is fine with redshift dialect

$ sqlfluff lint --dialect redshift example.sql --config xxx
All Finished 📜 🎉!

Observed Behaviour

The linter throws two errors:

First it forces the "SECOND" to be changed to lower case because CP02 | Unquoted identifiers must be consistently lower case.
After changing it to "second", the linter still raises an error :

RF02 | Unqualified reference 'second' found in select with more
          | than one referenced table/view.
          | [references.qualification]

These behavior suggests that the databricks dialect is treating the "SECOND" as an identifier instead of as a keyword.

Proof that 'SECOND' is being treated as an identifier:

Run $ sqlfluff parse --dialect databricks example.sql
And see the "[L: 5, P: 16] | naked_identifier: 'SECOND'"

[L:  1, P:  1]      |file:
[L:  1, P:  1]      |    statement:
[L:  1, P:  1]      |        select_statement:
[L:  1, P:  1]      |            select_clause:
[L:  1, P:  1]      |                keyword:                                      'SELECT'
[L:  1, P:  7]      |                newline:                                      '\n'
[L:  2, P:  1]      |                whitespace:                                   '    '
[L:  2, P:  5]      |                [META] indent:
[L:  2, P:  5]      |                select_clause_element:
[L:  2, P:  5]      |                    column_reference:
[L:  2, P:  5]      |                        naked_identifier:                     'a'
[L:  2, P:  6]      |                comma:                                        ','
[L:  2, P:  7]      |                newline:                                      '\n'
[L:  3, P:  1]      |                whitespace:                                   '    '
[L:  3, P:  5]      |                select_clause_element:
[L:  3, P:  5]      |                    column_reference:
[L:  3, P:  5]      |                        naked_identifier:                     'b'
[L:  3, P:  6]      |                [META] dedent:
[L:  3, P:  6]      |            newline:                                          '\n'
[L:  4, P:  1]      |            from_clause:
[L:  4, P:  1]      |                keyword:                                      'FROM'
[L:  4, P:  5]      |                whitespace:                                   ' '
[L:  4, P:  6]      |                from_expression:
[L:  4, P:  6]      |                    [META] indent:
[L:  4, P:  6]      |                    from_expression_element:
[L:  4, P:  6]      |                        table_expression:
[L:  4, P:  6]      |                            table_reference:
[L:  4, P:  6]      |                                naked_identifier:             'my_table'
[L:  4, P: 14]      |                    [META] dedent:
[L:  4, P: 14]      |            newline:                                          '\n'
[L:  5, P:  1]      |            where_clause:
[L:  5, P:  1]      |                keyword:                                      'WHERE'
[L:  5, P:  6]      |                [META] (implicit) indent:
[L:  5, P:  6]      |                whitespace:                                   ' '
[L:  5, P:  7]      |                expression:
[L:  5, P:  7]      |                    function:
[L:  5, P:  7]      |                        function_name:
[L:  5, P:  7]      |                            function_name_identifier:         'DATEDIFF'
[L:  5, P: 15]      |                        bracketed:
[L:  5, P: 15]      |                            start_bracket:                    '('
[L:  5, P: 16]      |                            [META] indent:
[L:  5, P: 16]      |                            expression:
[L:  5, P: 16]      |                                column_reference:
[L:  5, P: 16]      |                                    naked_identifier:         'SECOND'
[L:  5, P: 22]      |                            comma:                            ','
[L:  5, P: 23]      |                            whitespace:                       ' '
[L:  5, P: 24]      |                            expression:
[L:  5, P: 24]      |                                column_reference:
[L:  5, P: 24]      |                                    naked_identifier:         'timestamp_a'
[L:  5, P: 35]      |                            comma:                            ','
[L:  5, P: 36]      |                            whitespace:                       ' '
[L:  5, P: 37]      |                            expression:
[L:  5, P: 37]      |                                column_reference:
[L:  5, P: 37]      |                                    naked_identifier:         'timestamp_b'
[L:  5, P: 48]      |                            [META] dedent:
[L:  5, P: 48]      |                            end_bracket:                      ')'
[L:  5, P: 49]      |                    whitespace:                               ' '
[L:  5, P: 50]      |                    comparison_operator:
[L:  5, P: 50]      |                        raw_comparison_operator:              '>'
[L:  5, P: 51]      |                    whitespace:                               ' '
[L:  5, P: 52]      |                    numeric_literal:                          '1'
[L:  5, P: 53]      |                [META] dedent:
[L:  5, P: 53]      |    newline:                                                  '\n'
[L:  6, P:  1]      |    [META] end_of_file:

How to reproduce

To reproduce the issue:

The example.sql:

SELECT
  my_table.a,
  other_table.b
FROM my_table
  LEFT JOIN other_table
    ON DATEDIFF(SECOND, my_table.timestamp_a, other_table.timestamp_b) > 1

Dialect

Databricks and SparkSQL

Version

sqlfluff, version 2.3.5
Python 3.9.18

Configuration

[sqlfluff]
templater = jinja
sql_file_exts = .sql
# L016 - Line Length (unnecessary given our macro logic)
# L029 - Keywords should not be used as identifies (too many legacy tables rely on this)
# L034 - Statement ordering complexity (not-applicable)
exclude_rules = L016,L029,L034
large_file_skip_byte_limit=50000

[sqlfluff:indentation]
indent_unit = space
tab_space_size = 2
indented_joins = true
indented_using_on = true
template_blocks_indent = false

[sqlfluff:rules]
allow_scalar = True
unquoted_identifiers_policy = all
hanging_indents = False

[sqlfluff:layout:type:comma]
line_position = trailing


[sqlfluff:rules:capitalisation.literals]
capitalisation_policy = upper

[sqlfluff:rules:capitalisation.keywords]
# Inconsistent capitalisation of keywords
capitalisation_policy = upper

[capitalisation.identifiers]
# Inconsistent capitalisation of unquoted identifiers.
extended_capitalisation_policy = lower
unquoted_identifiers_policy = all

[sqlfluff:rules:capitalisation.functions]
extended_capitalisation_policy = upper

[sqlfluff:rules:references.quoting]
ignore_words=time,date

Are you willing to work on and submit a PR to address the issue?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct