bykof / cdc-postgresql-clickhouse

This project sets up a real-time data pipeline utilizing Change Data Capture (CDC) to stream changes from a PostgreSQL database to a ClickHouse database through Apache Kafka. Using Debezium to capture and stream database changes, and Kafka Connect to sink the data into ClickHouse, this pipeline allows for efficient data synchronization.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Project Title

Change Data Capture (CDC) with PostgreSQL and ClickHouse.

Getting Started

This project sets up a real-time data pipeline utilizing Change Data Capture (CDC) to stream changes from a PostgreSQL database to a ClickHouse database through Apache Kafka. Using Debezium to capture and stream database changes, and Kafka Connect to sink the data into ClickHouse, this pipeline allows for efficient and reliable data synchronization and analytics.

Prerequisites

  • Docker
  • Docker Compose
  • Git

Services

The project consists of the following services:

Required:

  • Postgres
  • ClickHouse
  • Kafka
  • Zookeeper
  • Kafka-Connect

Optional:

  • Kowl (Apache Kafka UI)
  • Adminer (PostgreSQL UI)

Usage

1. Clone repository and start docker containers

git clone https://github.com/leartbeqiraj1/cdc-postgresql-clickhouse.git
cd cdc-postgresql-clickhouse/
docker-compose up -d

2. Check if all components are up and running

docker compose ps
NAME                SERVICE                    STATUS                   PORTS
postgres            postgres                   running             0.0.0.0:5432->5432/tcp, :::5432->5432/tcp
clickhouse-server   clickhouse                 running             9000/tcp, 0.0.0.0:8123->8123/tcp, :::8123->8123/tcp, 9009/tcp
kafka               kafka                      running             0.0.0.0:9092->9092/tcp, :::9092->9092/tcp, 0.0.0.0:9101->9101/tcp, :::9101->9101/tcp
zookeeper           zookeeper                  running             2888/tcp, 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 3888/tcp
kafka-connect       kafka-connect              running (healthy)   0.0.0.0:8083->8083/tcp, :::8083->8083/tcp, 9092/tcp
kowl                kowl                       running             0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
adminer             adminer                    running             0.0.0.0:7775->8080/tcp, :::7775->8080/tcp

3. Verify PostgreSQL data

Login to Postgres UI and verify that there are demo_table1 and demo_table2 with 2 rows inserted each. Username and Password are both "postgres"

4. Verify connection between Debezium Connector and Kafka

Open Kowl topics and verify that Debezium Connector has sucessfully published PostgreSQL data to a Kafka topic. There should be a topic named demo_data and it should contain 4 messages inside (all 4 rows from PostgreSQL tables).

5. Verify connection between ClickHouseSink Connector and Kafka

Navigate to Consumer Groups to verify that ClickHouseSink Connector is a stable consumer, and it has successfully subscribed to the demo_data topic.

6. Verify ClickHouse data

Open ClickHouse UI and verify that PostgreSQL data are already pushed to ClickHouse from ClickHouseSink Connector by executing below:

SELECT * FROM demo_table1_mv FINAL;
SELECT * FROM demo_table2_mv FINAL;

Testing

1. Make new Inserts or Updates in existing PostgreSQL tables

Go to Postgres UI and insert or update existing rows.

2. Go Back to ClickHouse

Open ClickHouse UI again and you should see your changes already applied.

Conclusion

Inspired by a blog post on ClickHouse's official website (ClickHouse PostgreSQL Change Data Capture (CDC) - Part 1), I decided to replicate the setup. Due to the absence of extensive blogs, posts, or documentation on how to set up this particular pipeline locally, I was motivated to create this documentation. Through this effort, I hope to contribute to the community's shared knowledge and facilitate a smoother setup process for this CDC pipeline.

About

This project sets up a real-time data pipeline utilizing Change Data Capture (CDC) to stream changes from a PostgreSQL database to a ClickHouse database through Apache Kafka. Using Debezium to capture and stream database changes, and Kafka Connect to sink the data into ClickHouse, this pipeline allows for efficient data synchronization.