[Feature Request] Improving Data Validation Output

Question

[Feature Request] Improving Data Validation Output

oksanagorbachenko opened this issue 8 months ago · comments

Oksana Gorbachenko commented 8 months ago

Is your feature request related to a problem? Please describe.

The dbt-expectations library offers a comprehensive suite of data validation tests, but it consistently follows a rigid template for presenting results. The core issue lies in the absence of detailed context when a test fails. For instance, in tests like expect_column_values_not_to_be_null, the output simply states expression-false.To address this limitation, I propose an enhancement that allows users to specify a database column (for example, the unique identifier, such as 'id') as an input parameter for data validation. This addition would substantially improve our ability to identify the specific rows causing the issues, facilitating more effective troubleshooting.

Describe the solution you'd like

The solution I seek involves extending the capabilities of the dbt-expectations library. Specifically, I recommend introducing a parameter that enables users to specify a database column as part of the validation process. When applied, this feature would result in output that includes the designated column's values for rows with failed tests. This enriched output would provide essential context, making it significantly easier to pinpoint the exact rows causing data discrepancies.
Since the main request is generated in the file expression_is_true.sq it will look something like this:

{% macro expression_is_true(model, expression, test_condition="= true", group_by_columns=None, row_condition=None, **select_columns=None**) %}
  with grouped_expression as (
    select
      {% if group_by_columns is not none %}
        {% for group_by_column in group_by_columns %}
          {{ group_by_column }} as col_{{ loop.index }},
        {% endfor %}
      {% endif %}
      {{ dbt_expectations.truth_expression(expression)  ~ ", " ~ **select_columns**}}
...

Describe alternatives you've considered

One potential alternative would involve manual post-processing of the dbt-expectations output to associate failed tests with the corresponding data. However, this manual approach is not only time-consuming but also prone to errors. The optimal solution is to incorporate this feature directly into the library, ensuring a streamlined validation process and providing a more accurate identification of data issues.

Additional context

Furthermore, it would be highly beneficial to enhance the customization capabilities of the validation reports even further. This could include the ability to dynamically add fields from other related tables to the output. Such advanced customization would enable us to create more informative and context-rich reports, facilitating in-depth data analysis and troubleshooting in complex data projects. This level of flexibility and detail in report generation would significantly elevate the utility of the dbt-expectations library and contribute to comprehensive data quality assurance in our project.

Matt Byrne · Answer 1 · Tue May 28 2024 15:21:35 GMT+0800 (China Standard Time)

Yep - one of the super useful features of GX itself is that you can get really useful context as to what is wrong with your data. I'm surprised this hasn't been raised more ... this is needed if you are using the tool for data reconciliation and not just as a guarding test for generating new models. dbt allows persistence of failures, so would be great if we could try to customise what a failure looks like. In some cases, returning the raw columns is not desirable, however, as they could contain PII data, so customisation features would be handy.