ZhengtongYan / QPPNet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

QPPNet in PyTorch DOI

The README below was written for the release tagged rabbit721.

This code contains a sample implementation for Plan-Structured Deep Neural Network Models for Query Performance Prediction presented at VLDB 2019, and the code for training/testing on

  • TPC-H queries generated using https://github.com/gregrahn/tpch-kit.git and benchmarked with Postgres

    The TPC-H data are generated with ./dbgen -s <scale factor>.

    The TPC-H queries are generated using ./qgen <query_id> -r <seed number> where varying seed random number is used to generate different queries from a template.

    Tables in PostgresSQL are created using dss.ddl and then have indexes created with tpch-postgres-index-ddl.sql.

    The query plan structure and query performance metrics are retrieved as json objects by running the generated queries above with the explain (analyze, format JSON, verbose) statement prepended.

  • TPC-H queries generated using https://github.com/gregrahn/tpch-kit.git and benchmarked with NoisePage

  • TPC-C queries and smallbank dataset generated using OLTP and benchmarked with NoisePage

Prerequisites

  • Linux or macOS
  • Python 3

Getting Started

Installation

  • Cloning the repo:

    git clone https://github.com/rabbit721/QPPNet.git
    cd QPPNet
    
  • Install the required python packages:

    • For pip: pip install -r requirements.txt
    • For conda: conda env create -f environment.yml

Getting the Datasets

  • TPC-H benchmarked with PostgresSQL

    # Under directory dataset/postgres_tpch_dataset
    wget http://www.andrew.cmu.edu/user/jiejiao/data/qpp/postgres/tpch/psqltpch0p1g.zip && unzip psqltpch0p1g.zip
    wget http://www.andrew.cmu.edu/user/jiejiao/data/qpp/postgres/tpch/psqltpch1g.zip && unzip psqltpch1g.zip
    wget http://www.andrew.cmu.edu/user/jiejiao/data/qpp/postgres/tpch/psqltpch10g.zip && unzip psqltpch10g.zip
    
  • TPC-H benchmarked with NoisePage:

    Data files already located under directory dataset/terrier_tpch_dataset as execution_0p1G.csv, execution_1G.csv, and execution_10G.csv

  • TPC-C and smallbank benchmarked with NoisePage:

    Data files already located under directory dataset/oltp_dataset as tpcc_pipeline.csv and sb_pipeline.csv

Examples for Training a model

  • On TPC-H dataset generated using https://github.com/gregrahn/tpch-kit.git with SF=1 and benchmarked with PostgresSQL

    python3 main.py --dataset PSQLTPCH -s 0 -t 250000 --batch_size 128 -epoch_freq 1000 --lr 2e-3 --step_size 1000 --SGD --scheduler --data_dir ./dataset/postgres_tpch_dataset/tpch1g/900-exp_res_by_temp/  
    
  • On TPC-H dataset generated using https://github.com/gregrahn/tpch-kit.git with SF=1 and benchmarked with NoisePage

    python3 main.py --dataset TerrierTPCH -s 0 -t 250000 --batch_size 512 -epoch_freq 1000 --lr 1e-3 --step_size 1000 --SGD --scheduler --data_dir ./dataset/terrier_dataset/execution_1G.csv
    
  • On TPC-C dataset generated using OLTP with SF=1 and benchmarked with NoisePage

    python3 main.py --dataset OLTP -s 0 -t 250000 --batch_size 512 -epoch_freq 1000 --lr 5e-3 --step_size 1000 --SGD --scheduler --data_dir ./dataset/oltp_dataset/tpcc_pipeline.csv
    

Using a pre-trained model

  • Getting a model trained for 4000 epochs on TPC-H SF=1 dataset benchmarked with PostgresSQL:

    wget http://www.andrew.cmu.edu/user/jiejiao/data/qpp/trained_models/psqltpch_epoch4000.zip
    
  • Getting a model trained for 20000 epochs on TPC-H SF=1 dataset benchmarked with NoisePage:

    wget http://www.andrew.cmu.edu/user/jiejiao/data/qpp/trained_models/terriertpch_epoch20000.zip
    
  • Getting a model trained for 10000 epochs on TPC-C dataset generated with OLTP and benchmarked with NoisePage:

    wget http://www.andrew.cmu.edu/user/jiejiao/data/qpp/trained_models/tpcc_epoch10000.zip
    

Examples for Testing a trained model

When testing, please make sure that trained models are saved in ./saved_model and the mean and range values of the train data are provided with the flag '--mean_range_dict' in order to normalize the input.

The '-s' flag is interpreted as an integer and is used to specify the epoch number of the saved model to be tested.

  • Testing a model trained for 4000 epochs on TPC-H SF=1 dataset benchmarked with PostgresSQL on TPC-H SF=10 dataset benchmarked with PostgresSQL.

    python3 main.py --test_time --dataset PSQLTPCH -s 4000 --mean_range_dict mean_range_dict.pickle --data_dir ./dataset/postgres_tpch_dataset/tpch10G/900-exp_res_by_temp/
    
  • Testing a model trained for 10000 epochs on the TPC-C benchmark on the smallbank dataset. Please make sure that models are saved in ./saved_model

    python3 main.py --test_time --dataset OLTP -s 10000 --data_dir ./dataset/oltp_dataset/sb_pipeline.csv