This is the TPCDS data generator for Apache Spark, which is split off from spark-sql-perf and includes pre-built tpcds-kit for Mac/Linux x86_64 platforms.
Note that the current master
branch intends to support the Spark master branch and 3.0.0-preview2 on Scala 2.12.x. If you want to generate TPCDS test data in Spark 2.4.x, please use branch-2.4.
How to generate TPCDS data
First of all, you need to set up Spark:
$ git clone https://github.com/apache/spark.git
$ cd spark && ./build/mvn clean package -DskipTests
$ export SPARK_HOME=`pwd`
Then, you can generate TPCDS test data in /tmp/spark-tpcds-data
:
$ ./bin/dsdgen --output-location /tmp/spark-tpcds-data
How to run TPC-DS queries in Spark
You can run TPC-DS quries by using test data in /tmp/spark-tpcds-data
:
$ ./bin/spark-submit \
--class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark \
sql/core/target/spark-sql_<scala.version>-<spark.version>-tests.jar \
--data-location /tmp/spark-tpcds-data
Options for the generator
$ ./bin/dsdgen --help
Usage: spark-submit --class <this class> --conf key=value <spark tpcds datagen jar> [Options]
Options:
--output-location [STR] Path to an output location
--scale-factor [NUM] Scale factor (default: 1)
--format [STR] Output format (default: parquet)
--overwrite Whether it overwrites existing data (default: false)
--partition-tables Whether it partitions output data (default: false)
--use-double-for-decimal Whether it prefers double types (default: false)
--cluster-by-partition-columns Whether it cluster output data by partition columns (default: false)
--filter-out-null-partition-values Whether it filters out NULL partitions (default: false)
--table-filter [STR] Queries to filter, e.g., catalog_sales,store_sales
--num-partitions [NUM] # of partitions (default: 100)
Run specific TPC-DS quries only
To run a part of TPC-DS queries, you type:
$ ./bin/run-tpcds-benchmark --data-location [TPC-DS test data] --query-filter "q2,q5"
Other helper scripts for benchmarks
To quickly generate the TPC-DS test data and run the queries, you just type:
$ ./bin/report-tpcds-benchmark [output file]
This script finally formats performance results and appends them into ./reports/tpcds-avg-results.csv.
Notice that, if SPARK_HOME
defined, the script uses the Spark.
Otherwise, it automatically clones the latest master in the repository and uses it.
To check performance differences with pull requests, you could set a pull request ID in the repository as an option
and run the quries against it.
$ ./bin/report-tpcds-benchmark [output file] [pull request ID (e.g., 12942)]
Bug reports
If you hit some bugs and requests, please leave some comments on Issues or Twitter(@maropu).