huwngnosleep / complete_lakehouse_techstack

This project implements an end-to-end techstack for a data platform, can be used on production.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stack architecture

This project implements an end-to-end tech stack for a data platform

follow Data Lake-House architecture, there are main interfaces of this platform:

  • Distributed query/execution engine: Spark Thrift Server
  • Stream processing: Kafka
  • Storage: HDFS
  • Data mart: ClickHouse
  • Orchestration: Airflow
  • Main file format: Parquet with Snappy compression
  • Warehouse table format: Hive and Iceberg

How-to-run will be uploaded later

First run:

  1. Change the variable IS_RESUME in ./services/metastore/docker-compose.yml to False

  2. Grant all permissions for HDFS

sudo mkdir -p ./services/hadoop/data
sudo chmod  777 ./services/hadoop/data/*
  1. Create docker network docker network create default_net

  2. Docker up bash start_all_service.sh

After finishing all the above steps, change IS_RESUME back to True then rerun start all service

About

This project implements an end-to-end techstack for a data platform, can be used on production.


Languages

Language:Python 82.9%Language:TSQL 12.4%Language:Shell 3.8%Language:Dockerfile 0.8%