A provider-agnostic framework to evaluate ordinary Change Data Capture (CDC) features, comprising:
- A high-level evaluation plan.
- SQL statements to set up source and destination DB/DW environments.
- SQL statements to fully ingest CSV files into the source tables.
- Python scripts to manage ad-hoc chunks of data in the source tables.
The framework is intended to allow users to evaluate replication/streaming capabilities using the tools of their choice.
Coverage Report
File | Stmts | Miss | Cover |
---|---|---|---|
src/cdc_eval | |||
cdc_eval_cli.py | 31 | 0 | 100% |
kaggle_online_retail_ii_uci.py | 118 | 0 | 100% |
TOTAL | 149 | 0 | 100% |
- 1. Change Data Capture Evaluation Plan
- 2. Environment setup
- 3. Usage instructions
- 4. How to contribute
The high-level CDC evaluation plan comprises 5 major steps:
- Load big chunks of data into the source database before starting any replication/streaming jobs.
- Start the replication/streaming jobs using the CDC tool of your choice.
- [Optional] Wait for the CDC tool to stream all existing data (aka take an initial snapshot).
- Insert/update/delete records into/from the source tables using the Python scripts provided by the present framework.
- Monitor the CDC tool to make sure it is properly streaming all data changes.
Please refer to python.org/downloads for further details.
This is recommended so all related stuff will reside at the same place, making it easier to follow the next instructions.
mkdir ./cdc-evaluation-framework
cd ./cdc-evaluation-framework
All paths starting with ./
in the next steps are relative to the
cdc-evaluation-framework
folder.
This step is optional, but strongly recommended.
python3 -m venv env
source ./env/bin/activate
COMING SOON!
The Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
Summary:
Industry | Rows | Columns | Verified at |
---|---|---|---|
Retail | 1,067,371 | 8 | 2022-05-23 |
Download the archive.zip
file from
kaggle.com/datasets/mashlyn/online-retail-ii-uci
into cdc-evaluation-framework
, unzip and move it into a temporary folder:
unzip archive.zip
mv online_retail_II.csv /tmp/
Please refer to the sql/kaggle-online-retail-ii-uci folder for help.
Use the below instructions to accomplish the first step of the CDC Evaluation Plan.
-
Set the server's
local_infile
system variable to1
:set global local_infile = 1;
-
Provide the
--local-infile=1
flag when connecting the client:mysql -u <USER> -p --local-infile=1 <DATABASE>
-
Run the below command using the client:
LOAD DATA LOCAL INFILE '/tmp/online_retail_II.csv' INTO TABLE transactions FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 ROWS (invoice, stock_code, description, quantity, invoice_date, price, @customer_id, country) SET customer_id = NULLIF(@customer_id, '');
You can use the below command to automate the fourth step of the CDC Evaluation Plan.
cdc-eval kaggle-online-retail-uci \
--data-file datasets/kaggle-online-retail-ii-uci.csv \
--invoices <THE-NUMBER-OF-TRANSACTIONS> \
--db-conn <SQLALCHEMY-CONNECTION-STRING>
You can use the below command to automate the fourth step of the CDC Evaluation Plan.
cdc-eval kaggle-online-retail-uci \
--data-file datasets/kaggle-online-retail-ii-uci.csv \
--invoices <THE-NUMBER-OF-TRANSACTIONS> \
--db-conn <SQLALCHEMY-CONNECTION-STRING> \
--operation-mode delete
Please make sure to take a moment and read the Code of Conduct.
Please report bugs and suggest features via the GitHub Issues.
Before opening an issue, search the tracker for possible duplicates. If you find a duplicate, please add a comment saying that you encountered the problem as well.
Please make sure to read the Contributing Guide before making a pull request.