Incremental Snapshot Feature

Question

Incremental Snapshot Feature

NiyiOdumosu opened this issue 2 years ago · comments

Hello @jpechane and @gunnarmorling, I have been reading the Debezium documentation on the incremental snapshot feature for SQL Server. The documentation says that the feature is incubating. I wanted to know if either of you is working on that feature and if so what is the expected timeline for delivery. If you are not working on that feature, can you point me to the engineers who are developing it? You help is greatly appreciated!

Jiri Pechanec · Answer 1 · Fri Feb 11 2022 16:00:32 GMT+0800 (China Standard Time)

@NiyiOdumosu Hi, the feature is already done. The incubating marker is more like a warning that an API or an implementation might change but it is probably no longer needed.

Niyi Odumosu · Answer 2 · Sat Feb 12 2022 02:07:00 GMT+0800 (China Standard Time)

Thanks @jpechane ! I will try to test this feature out in a POC. Mind if I reach out to you if I have any questions?

Gunnar Morling · Answer 3 · Sat Feb 12 2022 02:21:24 GMT+0800 (China Standard Time)

Please bring any questions to our mailing list: https://groups.google.com/g/debezium. That way, it will get the most eyeballs, and future readers will benefit from any replies, too. Thanks!

…

Message ID: ***@***.*** .com>

Niyi Odumosu · Answer 4 · Wed Mar 09 2022 22:36:07 GMT+0800 (China Standard Time)

Hello!
In the past, we have used the DBZ SQL Server connector to migrate large volumes of historical data. If a table had billions of records and the connector failed while migrating, we would have to restart the connector and it would produce data from the beginning of the table again. We hoped that the incremental snapshot feature would solve this by taking a snapshots of the table and resume producing from the last committed offset.

Testing Scenario
We simulated a failure by deleting the connector in the middle of the migration. Then we restarted the connector hoping it will pick up where it left off. Instead what is happening is that it is producing the rows from the beginning of the table. Below are the stats. Keep in mind there are 1,408,376 records in the table.

How can we eliminate or at least minimize the duplicates so that the connector doesn't reproduce all the data it produced before it failed? Are there any configurations that we can modify to assist with this?

Any help would be greatly appreciated!
@jpechane @gunnarmorling

Jiri Pechanec · Answer 5 · Tue Mar 15 2022 16:06:49 GMT+0800 (China Standard Time)

@NiyiOdumosu Hi, could you please move the issue to Jira or to the chat/mailing list? The GitHub issues are not used by us. Thanks. Also while in that please provide the logs and also the offsets value before the connector restart.

Niyi Odumosu · Answer 6 · Tue Mar 15 2022 21:10:52 GMT+0800 (China Standard Time)

Hey @jpechane I have posted this question on the google groups twice and I have not received a response. I was not aware there was a Jira backlog I can post it to. Can you please send me the jira link?