xerial / spark-tpcds-datagen

All the things about TPC-DS in Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status

This is the TPCDS data generator for Apache Spark, which is split off from spark-sql-perf and includes pre-built tpcds-kit for Mac/Linux x86_64 platforms.

Note that the current master branch intends to support the Spark master branch and 3.0.0-preview2 on Scala 2.12.x. If you want to generate TPCDS test data in Spark 2.4.x, please use branch-2.4.

How to generate TPCDS data

First of all, you need to set up Spark:

$ git clone https://github.com/apache/spark.git

$ cd spark && ./build/mvn clean package -DskipTests

$ export SPARK_HOME=`pwd`

Then, you can generate TPCDS test data in /tmp/spark-tpcds-data:

$ ./bin/dsdgen --output-location /tmp/spark-tpcds-data

How to run TPC-DS queries in Spark

You can run TPC-DS quries by using test data in /tmp/spark-tpcds-data:

$ ./bin/spark-submit \
    --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark \
    sql/core/target/spark-sql_<scala.version>-<spark.version>-tests.jar \
    --data-location /tmp/spark-tpcds-data

Options for the generator

$ ./bin/dsdgen --help
Usage: spark-submit --class <this class> --conf key=value <spark tpcds datagen jar> [Options]
Options:
  --output-location [STR]                Path to an output location
  --scale-factor [NUM]                   Scale factor (default: 1)
  --format [STR]                         Output format (default: parquet)
  --overwrite                            Whether it overwrites existing data (default: false)
  --partition-tables                     Whether it partitions output data (default: false)
  --use-double-for-decimal               Whether it prefers double types (default: false)
  --cluster-by-partition-columns         Whether it cluster output data by partition columns (default: false)
  --filter-out-null-partition-values     Whether it filters out NULL partitions (default: false)
  --table-filter [STR]                   Queries to filter, e.g., catalog_sales,store_sales
  --num-partitions [NUM]                 # of partitions (default: 100)

Run specific TPC-DS quries only

To run a part of TPC-DS queries, you type:

$ ./bin/run-tpcds-benchmark --data-location [TPC-DS test data] --query-filter "q2,q5"

Other helper scripts for benchmarks

To quickly generate the TPC-DS test data and run the queries, you just type:

$ ./bin/report-tpcds-benchmark [output file]

This script finally formats performance results and appends them into ./reports/tpcds-avg-results.csv. Notice that, if SPARK_HOME defined, the script uses the Spark. Otherwise, it automatically clones the latest master in the repository and uses it. To check performance differences with pull requests, you could set a pull request ID in the repository as an option and run the quries against it.

$ ./bin/report-tpcds-benchmark [output file] [pull request ID (e.g., 12942)]

Bug reports

If you hit some bugs and requests, please leave some comments on Issues or Twitter(@maropu).

About

All the things about TPC-DS in Apache Spark

License:Apache License 2.0


Languages

Language:TSQL 41.0%Language:Scala 39.9%Language:Shell 19.1%