calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros

Home Page:https://calogica.github.io/dbt-expectations/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] expect_column_most_common_value_to_be_in_set handling of ties

lookslikeitsnot opened this issue · comments

Is this a new bug in dbt-expectations?

  • I believe this is a new bug in dbt-expectations
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When most common values are non-unique, running expect_column_most_common_value_to_be_in_set only works for 1 of the most common values.

Expected Behavior

expect_column_most_common_value_to_be_in_set should match all most common values

Steps To Reproduce

in dbt-expectations/integration_tests/models/schema_tests/schema.yml, add following column test for model data_test column col_string_b.:

          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['ab']
              top_n: 1

Although all values are distinct in that model column (i.e. all column values are "the most common"), any value in value_set different from 'b' (the first result naturally ordered) will fail.

Relevant log output

with value_counts as (
    select
        col_string_b as value_field,
        count(*) as value_count
    from dbt_expectations_integration_tests.data_test
    group by col_string_b
),
value_counts_ranked as (
    select
        *,
        row_number() over(order by value_count desc) as value_count_rank
    from
        value_counts
),
value_count_top_n as (
    select
        value_field
    from
        value_counts_ranked
    where
        value_count_rank = 1
),
set_values as (
    select
        'ab' as value_field
),
unique_set_values as (
    select distinct value_field
    from
        set_values
),
validation_errors as (
    -- values from the model that are not in the set
    select
        value_field
    from
        value_count_top_n
    where
        value_field not in (select value_field from unique_set_values)
)
select *
from validation_errors

Environment

- OS: Manjaro Linux 22.0.2
- Python: 3.10.9
- dbt: 1.5.0
- dbt-expectations: 0.8.5

Which database adapter are you using with dbt?

Postgres

Note: dbt-expectations currently does not support database adapters other than the ones listed below.

  • Postgres
  • Snowflake
  • BigQuery

Additional Context

Great expectations has an additional ties_okay argument for partial matches. We could also add support for that.

So following tests would succeed

            # Expect error if not all most common values are in the set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b']
              top_n: 1
              config:
                error_if: "=0"
                warn_if: "<3"
            # Expect success if not all most common values are in the set but ties_okay is set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b']
              top_n: 1
              ties_okay: true
            # Expect error if none of the most common values are in the set and ties_okay is set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['invalid_value']
              top_n: 1
              ties_okay: true
              config:
                error_if: "=0"
                warn_if: "<4"
            # Expect success if not all most common values are in the set but ties_okay is set
            # and the set contains extra values 
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b', 'invalid_value']
              top_n: 1
              ties_okay: true
            # Expect success if not all most common values are in the set but ties_okay is set
            # and value is not first one of the column naturally ordered
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['ab']
              top_n: 1
              ties_okay: true
            # Expect success if all most common values are in the set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b', 'ab', 'abc', 'abcd']
              top_n: 1
            # Expect success if all most common values are in the set 
            # and the set contains extra values 
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b', 'ab', 'abc', 'abcd', 'invalid_value']
              top_n: 1
            # Expect error if none of the most common values are in the set 
            # and the set contains extra values 
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['invalid_value1', 'invalid_value2', 'invalid_value3', 'invalid_value4', 'invalid_value5']
              top_n: 1
              config:
                error_if: "=0"
                warn_if: "<4"