tzolov / spring-debezium-demos

Various experiments and samples with Debezium and Spring Framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incremental Snapshots

The Change-Data-Capture (CDC) is generic mechanism for capturing changed rows from a database’s transaction log and delivers them downstream with low latency. In order to solve the data synchronization problem one also needs to replicate the full state of the database.

But the transaction logs typically do not contain the full history of changes. The initial snapshot is used to accomplish this by scanning the database tables in a single, read transaction operation. Depends on the database size this could take hours to complete. Furthermore until it completes no new database changes can be streamed.

There are use cases that require high availability of the transaction log events so that databases stay as closely in-sync as possible. To facilitate such use-cases, Debezium includes an option to perform ad hoc snapshots called incremental snapshots.

Unlike the initial snapshot, the incremental snapshot captures tables in chunks, rather than all at once using a watermarking method to track the progress of the snapshot. Furthermore the incremental snapshot captures the initial state of the tables, without interrupting the streaming of transaction log events!

You can initiate an incremental snapshot at any time. Also the process can be paused/interrupted at any time and later resumed from the point at which it was stopped.

Table 1. Table comparing the initial vs incremental snapshot features.
Snapshot features Initial Incremental

Can be triggered at any time

NO

YES

Can be paused and resumed

NO

YES

Log event processing does not stall

NO

YES

Preserves the order of history

YES

YES

Does not use locks

NO

YES

Debezium implements a special signal protocol for starting, stopping, pausing or resuming the incremental snapshots.

To enable the incremental snapshot one need to create a signaling table with fixed data structure and then use one of the provided Signal Channels to start/stop/pause/resume the process. Use the debezium.properties.signal.enabled.channels property to select the desired signal channel. For example here we would enable the source and the SpringSignalChannelReader channel readers.

debezium.properties.signal.enabled.channels=source,SpringSignalChannelReader

And the use an Application event to trigger the snapshot:

private ApplicationEventPublisher publisher = ...;

SignalRecord signal = new SignalRecord("'ad-hoc-666", "execute-snapshot",
  "{\"data-collections\": [\"testDB.dbo.orders\", \"testDB.dbo.customers\", \"testDB.dbo.products\"],\"type\":\"incremental\"}",
  null);
this.publisher.publishEvent(signal);

Check the CdcDemoApplication.java for a sample implementation that leverages the SpringSignalChannelReader channel. Note that the the incremental snapshot still requires the presence and use of the signaling table and therefore the source channel as well.

The SNAPSHOT1 and SNAPSHOT2 integration test leverage the source channel and use JdbcTemplate to write signals into the signal table.

The exact SQL statement implemented by the stopIncrementalSnapshotFor() method varies amongst the different database.

Exactly-Once Delivery

By default Debezium provides the at-least-once event delivery semantics. This means Debezium guarantees every single change will be delivered and there is no missing or skipped change event. However, in case of failures, restarts or DB connection drops, the same event can be delivered more than once.

The exactly-once delivery (or semantic) provides stronger guarantee - every single message will be delivered and at the same time there won’t be any duplicates, every single message will be delivered exactly once.

What are the option to ensure the exactly-once semantics (EOS) for the records produced by Debezium?

For a fully fledged Debezium deployments, with a dedicated Kafka Connect cluster, one can leverage the exactly-once delivery support already provided by Kafka Connect. The Kafka Connect is standalone distributed system, deployed and managed in its own cluster. You can find more about this approach here.

As our integration is based on the emendable Debezium Engine that runs directly in the application space, the Kafka Connect EOS support is not applicable.

For the Embedded Debezium Engine integrations, such as the Spring Boot Debezium integration, that runs directly in the application space, the Kafka Connect EOS support is not applicable. To ensure exactly-once delivery for such applications the user has to implement their own idempotent handlers or message deduplication.

You can find here couple of samples and test that illustrate how the deduplication can be implemented.

In general any de-duplication implementation would require (1) an unique Transaction ID; and (2) an efficient caching mechanism for it.

The PostgresCdcDemoApplication.java sample uses the Postgres' Long Serial Number (LSN) as a CDC transaction ID. Furthermore the debezium.properties.transforms.flatten.add.headers=lsn is used to assign the lsn to the message header.

Note
the lsn is Postgres specific. For MySQL one can opt for pos and for SQLServer for change_lsn and commit_lsn instead. For other connectors follow the Debizium connector documentations.

Next the sample uses Bloom Filters to test if an LSN is already received. The Bloom Filters is very an efficient, low footprint probabilistic data structure used to test whether an element is a member of a set. It guaranties zero False Negative answers. That means that if an LSN has never been received the match test will always return false. Though the Bloom Filters can produce False Positive answers, it can be configured so that this number is very low. In addition to the Bloom Filters we implement a simple Java Set<LSN> to filter out the False Positive cases.

One can play with the Bloom Filter implementations. Here we opted for Guave Bloom Filters, but the eosapp/bloom folder contains two additional experimental implementations. Also you can experiment with different LSN caching mechanism instead of simple Set and implement range or time expiration strategies to improve the performance.

The PostgresEosTest.java integration test, implements an end-to-end test with different debezium.offsetCommitPolicy. For example one can observe that in case of debezium.offsetCommitPolicy=ALWAYS duplications almost never observed! This comes to a hit to the throughput as it required storing the CDC offset on every transaction.

Another approach to handle duplications is the SI idempotent-receiver. The idempotent receiver uses MessageSelector to extract the duplication check ID and is backed by a MetadataStore to keep the processed ids. The PostgresEos2Test.java test explores this approach.

Signaling and Notifications

The Debezium signaling mechanism provides a way to modify the behavior of a connector, or to trigger a one-time action, such as initiating an ad hoc snapshot of a table.

The Debezium notifications provide a mechanism to obtain status information about the connector. Notifications can be sent to the configured channels.

The Spring Debezium-Signals project implements Spring Boot integrations for the signaling and notifications:

  • SpringNotificationChannel - Debezium notification integration, that wraps the io.debezium.pipeline.notification.Notification signals into Spring Application Events.

  • SpringSignalChannelReader - Debezium channel integration, that allows wrapping and sending the io.debezium.pipeline.signal.SignalRecord signals as Spring Application Events.

For a complete example check the CdcDemoApplication.java sample.

About

Various experiments and samples with Debezium and Spring Framework.


Languages

Language:Java 97.6%Language:TSQL 1.8%Language:Dockerfile 0.4%Language:Shell 0.1%