debezium / debezium-design-documents

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incremental Snapshot Feature

NiyiOdumosu opened this issue · comments

Hello @jpechane and @gunnarmorling, I have been reading the Debezium documentation on the incremental snapshot feature for SQL Server. The documentation says that the feature is incubating. I wanted to know if either of you is working on that feature and if so what is the expected timeline for delivery. If you are not working on that feature, can you point me to the engineers who are developing it? You help is greatly appreciated!

@NiyiOdumosu Hi, the feature is already done. The incubating marker is more like a warning that an API or an implementation might change but it is probably no longer needed.

Thanks @jpechane ! I will try to test this feature out in a POC. Mind if I reach out to you if I have any questions?

Hello!
In the past, we have used the DBZ SQL Server connector to migrate large volumes of historical data. If a table had billions of records and the connector failed while migrating, we would have to restart the connector and it would produce data from the beginning of the table again. We hoped that the incremental snapshot feature would solve this by taking a snapshots of the table and resume producing from the last committed offset.

Testing Scenario
We simulated a failure by deleting the connector in the middle of the migration. Then we restarted the connector hoping it will pick up where it left off. Instead what is happening is that it is producing the rows from the beginning of the table. Below are the stats. Keep in mind there are 1,408,376 records in the table.

ca72b7ad-b493-470a-9500-a5d73b6c7d51

How can we eliminate or at least minimize the duplicates so that the connector doesn't reproduce all the data it produced before it failed? Are there any configurations that we can modify to assist with this?

Any help would be greatly appreciated!
@jpechane @gunnarmorling

@NiyiOdumosu Hi, could you please move the issue to Jira or to the chat/mailing list? The GitHub issues are not used by us. Thanks. Also while in that please provide the logs and also the offsets value before the connector restart.

Hey @jpechane I have posted this question on the google groups twice and I have not received a response. I was not aware there was a Jira backlog I can post it to. Can you please send me the jira link?