datafold / data-diff

Compare tables within or across databases

Home Page:https://docs.datafold.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Version change from v0.8.5rc1 to v0.11.1 it slows down the data-diff process

saurasingh opened this issue · comments

Describe the bug
We use data-diff to find the difference between the table in the source which is Mysql and the destination which is Snowflake.

here is how we do it:

we create a source connector
source_con = data_diff.connect_to_table( source_connection, table_name=f"{table}", key_columns=f"{col}", thread_count=thread_count )
we create destination connector
target_con = data_diff.connect_to_table( target_connection, f"{target_db.upper()}.{target_schema.upper()}.{table.upper()}", f"{col.upper()}", thread_count=thread_count, )

and then we call data_diff.diff_tables like below

diff_table = data_diff.diff_tables( source_con, target_con, bisection_factor=bisection_factor, threaded=True, max_threadpool_size=max_threadpool_size, max_key=max_id, )

with version v0.8.5rc1 it is working fine and goes into a bisection only if there is a difference, but when we upgrade to any other version(the last version we upgraded to is v0.11.1) it runs really slow and tries to go into bisection to check for diff but do not find any.
[2024-04-04, 22:04:34 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 39/100, key-range: (34680062)..(34696282), size <= 1621789 [2024-04-04, 22:04:35 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 40/100, key-range: (34696282)..(34712502), size <= 1621789 [2024-04-04, 22:04:35 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 41/100, key-range: (34712502)..(34728722), size <= 1621789 [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 42/100, key-range: (34728722)..(34744942), size <= 1621789 [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 43/100, key-range: (34744942)..(34761162), size <= 1621789 [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 44/100, key-range: (34761162)..(34777382), size <= 1621789 [2024-04-04, 22:04:49 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:49 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 45/100, key-range: (34777382)..(34793602), size <= 1621789

This behavior makes the data-diff task run for a couple of hours on the new version whereas on the older version it used to finish only in a few minutes.

A clear and concise description of what the bug is.

Make sure to include the following (minus sensitive information):

  • The command or code you used
  • The run output + error you're getting. (including tracestack)
  • Run data-diff with the -d switch for extra debug information.

If possible, please paste these as text, and not a screenshot.

Describe the environment

Describe which OS you're using, which data-diff version, and any other information that might be relevant to this bug.

Hi @saurasingh,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. We have completely rewritten the diffing engine in the cloud over the past few months and have solved the fundamental issues with the original algorithm used in the data-diff package. Feel free to take it for a trial or contact us at support@datafold.com if you have any questions.

-Gleb