blaze-init / spark-blaze-extension

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generate TPC-DS dataset through dsgen

yjshen opened this issue · comments

Currently, the TPC-DS sf=1 dataset is generated once and placed in 'dev/tpcds_1g', making our repo huge.

To avoid tracking in git and repeated generation, we should generate the dataset in Github Actions and cache the datasets.

  • Github Action grants each project a 10GB cache, and we currently only use less than 1GB.