eyalroz / tpch_q1_on_gpu

TPC-H Query 1 CPU and GPU implementations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TPC-H Query 01 Optimized for GPU execution

This fork of the original repository is intended for continued work on the GPU-side code by myself (Eyal). I did not have the time to code all of the implementation variants I had wanted before the ADMS 2018 paper was due, plus I was unhappy with other aspects of the code, so I've continued a bit further.

This repository should be a stand-in for the forked one, with the main differences being:

  • More distinct implementations.
  • Faster execution on compressed data (via caching; so the second call and onwards).
  • Support for choosing one of several devices on your system.
  • Slightly more structuring of the code in main.cu
  • Better separation between general-purpose utility code and code specific to our work.
  • Other stuff - resolved and unresolved.

TPC-H Query 01 execution times:

(to be filled in; for now, use the table in the published paper.)

Prerequisites

  • CUDA v9.0 or later is recommended; CUDA v8.0 will probably work, but has not been tested.
  • A C++14-capable compiler compatible with your version of CUDA; only GCC has been tested.
  • CMake v3.1 or later
  • A Unix-like environment for the (simple) shell scripts; without it, you may need to perform a few tasks manually
  • The cuda-api-wrappers library.

Building the Q1 benchmark binary

Assuming you've cloned into /path/to/tpch_q1_gpu:

  • Configure the build and generate build files using cmake /path/to/tpchQ1
  • Build using either your default make'ing tool or with cmake --build /path/to/tpchQ1; this will also generate input data for Scale Factor 1 (SF 1)

TPC-H benchmark data

The binary uses the LINEITEM table from the TPC-H benchmark data set. It is expected to reside in a subdirectory of where you run your binary; thus if we're in /foo/bar and call bin/tpch_q1 (with scale factor 123), a lineitem.tblfiles must reside in foo/bar/tpch_data/123.000000. Alternatively, if the binary has already cached the data after loading it before, .bin files will have been created in the same directory, e.g. foo/bar/tpch_data/123.000000/shipdate.bin and/or foo/bar/tpch_data/123.000000/compressed_shipdate.bin for speedier reading. In this case, the binary will be willing to ignore a missing tpch.

Generating the data

  • When building, the data for TPC-H Scale Factor 1 (SF 1) is generated as one of the default targets.
  • You can use the build mechanism to generate data for two more scale factors - SF 10 and SF 100 - using make -C /path/to/tpchQ1 data_table_sf_10 or make -C /path/to/tpchQ1 data_table_sf_100.
  • For arbitrary scale factors, invoke the scripts/genlineitem.sh script.

tpch_01 command-line options

Switch Value range Default value Meaning
--device 0 ... number of CUDA device-1 0 Use the CUDA device with the specified index.
--print-results N/A (off) Print the computed aggregates to std::cout after every run. Useful for debugging result stability issues.
--use-filter-pushdown N/A (off) Have the CPU check the TPC-H Q1 WHERE clause condition, passing only that result bit vector to the GPU. It's debatable whether this is actually a "push down" in the traditional sense of the term.
--use-coprocessing N/A (off) Schedule some of the work to be done on the CPU and some on the GPU
--hash-table-placement in-registers, local-mem, per-thread-shared-mem, global in-registers Memory space + granularity for the aggregation tables; see the paper itself or the code for an explanation of what this means.
--sf= Integral or fractional number, limited precision 1 Which scale factor subdirectory to use (to look for the data table or cached column files). For sf 123.456789, data will be expected under tpch/123.456789
--streams= Positive integral value 4 The number of concurrent streams to use for scheduling GPU work. You should probably not change this.
--threads-per-block= Positive integral number, preferably a multiple of 32 256 Number of CUDA threads per block of a scheduled kernel of the computational work.
--tuples-per-thread=H Positive integral number, preferably high 1024 The number of tuples each thread processes individually before merging results with other threads
--tuples-per-kernel= Positive integral number, preferably a multiple of threads-per-block 1024 Every how many input tuples is a new kernel launched?

What is TPC-H Query 1?

The query text and column information is on the Wiki. For further information about the benchmark it is part of, see the Transaction Processing Council's page for TPC-H

About

TPC-H Query 1 CPU and GPU implementations

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:C++ 63.9%Language:Cuda 20.7%Language:CMake 11.7%Language:Python 2.4%Language:R 0.9%Language:Shell 0.3%