datafold / data-diff

Compare tables within or across databases

Home Page:https://docs.datafold.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data-diff fails to identify unique composite primary keys

gabrielleberanger opened this issue · comments

Describe the bug

I am running data-diff on a model having a composite primary key, configured as such:

version: 2
​
models:
  - name: bse_metabase__permissions_group_membership
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: ['user_id', 'group_id']

But the data-diff execution fails on the _test_duplicate_keys method (while there are actually no duplicates).

Suggested resolution

data-diff checks for duplicates using CONCAT(<field_1>, <field_2>), but CONCAT(<field_1>, '-', <field_2>) would be more appropriate.

Let's assume that we have the following rows:

  • Row 1: user_id = 1 x group_id = 11
  • Row 2: user_id = 11 x group_id = 1

In the above example:

  • CONCAT(<field_1>, <field_2>) returns 111 for both rows (hence the error raised)
  • CONCAT(<field_1>, '-', <field_2>) returns 1-11 and 11-1, which is more robust

@gabrielleberanger Do you still want to work on this pull request? I remember us talking about this in the dbt community slack!

@sungchun12 yes, I do!
I'm starting to work on it right now, so you should receive a PR from me by the end of this week.

Lovely, look forward to reviewing it when it's ready!