deka108 / datagen

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Datagen

Generate data for experimentations

Distributed TPCH-data

Require: git clone git@github.com:lovasoa/TPCH-sqlite.git

  1. Generate TPC-H.db
./tpch gen-tpch.sh [path to the TPCH-sqlite repo] [SCALE] # this generate data under tpch directory
  1. Run gen_dist_tpch.py to distribute data in TPC-H.db

To be done:

  • integrate sqlite, pandas, and numpy for easy data generation to multiple node settings
  • distribute data based on tuple count distributions: equal, left, right, random
  • partition data into nodes based on table and columns (can use consisten hashing)

About


Languages

Language:Python 85.4%Language:Shell 14.6%