julienlau / tpcx-hs

big data benchmark including data intensive workloads and random generated data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

THE TPC SOFTWARE IS AVAILABLE WITHOUT CHARGE FROM TPC.

TPCx-HS - Version 2

Purpose

This repository takes the legacy TPCx-HS benchmarks and adds extended functionalities plus upgrade to more recent version of spark and schedulers.

Note: TPCx-HS Version 1 and TPCx-HS Version 2 are NOT comparable

TPCx-HS is a Big Data System Benchmark

The Hadoop ecosystem is moving fast beyond batch processing with MapReduce. Introduced in 2016 TPCx-HS V2 is based on TPCx-HS V1 with support for Apache Spark - a popular platform for in-memory data processing that enables real-time analytics on Apache Hadoop. TPCx-HS V2 also supports MapReduce (MR2) and supports publications on traditional on premise deployments and clouds. More information about TPCx-HS v1 can be found at http://www.tpc.org/tpcx-hs/default5.asp?version=1. The TPCx-HS v2 benchmark can be used to assess a broad range of system topologies and implementation methodologies in a technically rigorous and directly comparable, in a vendor-neutral manner.

The current TPCx-HS Version 2 specification can be found on the TPC Documentation Webpage https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp

Usage

  • compile the source: TPCx-HS-SRC/README.md
  • make the jars (at least the spark jar) available using an hdfs, http server or S3/minio.
  • configure your running parameters in TPCx-HS-Runtime-Suite/Benchmark_Parameters.sh
  • run : cd TPCx-HS-Runtime-Suite && ./TPCx-HS-master.sh -s -q k8s -g -1

Releases

  • 2.2.1 : (not a TPC release) improve shell for better error handling and auto-tuning + remove spark conf print in the logs
  • 2.2.0 : (not a TPC release) feat : add support for spark on kubernetes scheduler, spark v3 and custom hadoop defaultFS
  • 2.1.0 : (not a TPC release) add support for spark on mesos scheduler and spark v2
  • 2.0.3 : last release from TPC. Only support spark 1.6 using yarn scheduler.

About

big data benchmark including data intensive workloads and random generated data

License:Apache License 2.0


Languages

Language:Java 66.5%Language:Shell 19.9%Language:Scala 10.6%Language:Dockerfile 3.0%