kafka kafka-sql data-integration data-ingestion kafka-python

ingestd

The TPC-DI Benchmark

Created by the Transaction Processing Council as a benchmark for ETL applications, the TPC-DI benchmark involves integrating multiple data sources and file formats into a data warehouse schema. The benchmark specifies requirements both for historical and incremental loading, as well as transformations to be applied.

Dataset

As outlined in the official TPC documentation found here, the data model for the benchmark is a retail brokerage firm.

Flat files originate from 5 primary sources:

OLTP: CDC extracts, which are bar-delimited txt
HR: comma delimited database extract
Prospect List: comma delimited csv
Financial Newswire: Variable Fixed Width Format
CRM: XML

The destination DWH schema will have the following:

Facts	Dimensions	Reference
fCashBalances	dCustomers	TradeTypes
fHoldings	dAccounts	StatusTypes
fWatches	dBrokers	TaxRates
fMarketHistory	dSecurities	Industries
fProspects	dCompanies	Financials
fTrade	dDate, dTime

Kafka Ingestion & Integration

Data Serialization, Versioning

About

Data Integration via Confluent Kafka

kafka kafka-sql data-integration data-ingestion kafka-python

GNU General Public License v3.0

Languages

Language:Python 100.0%