ZhengtongYan / WorkloadCharacterization

analyzing workloads

Repository from Github https://github.comZhengtongYan/WorkloadCharacterizationRepository from Github https://github.comZhengtongYan/WorkloadCharacterization

WorkloadCharacterization

PostgreSQL setup

A lot of the scripts rely on PostgreSQL;

Then in the following scripts, set the user,pwd,port appropriately to generate the ground truth data.

Creating workload files from sqls

Two steps, first parse sqls to extract expressions (similar to how they look like in SCOPE)

  • Use ParsingSQLs.ipynb for general SQLs (tested on imdb,tpcds etc.)

  • (OR ParsingSQLs-zdbs.ipynb which are hardcoded for ziniu's db instances)

  • At this point, expr_df.csv should have been generated. Then collect the cardinality estimate for each expression using:

  • bash python3 get_rowcounts.py w/ appropriate GLOBAL variables set in the script (WK=ceb global variable)

  • bash python3 get_const_rowcounts.py ---> produces literal_df.csv, which adds additional cardinality data with single constants to expr_df.csv

(for ziniu's db instances:)

  • bash python3 get_rowcounts_zdbs.py

Running Evaluation for Generated Data

IMDb version

  • have data files, n.csv etc.

  • create table using a data file and evaluate on it.

    python3 create_table.py --inp_fn data/gen_data/new_data3/n.csv --port 5432 --data_kind gen_shuffle python3 eval_data.py --inp_to_eval n --data_kind gen_shuffle --num_queries 100 --port 5432

    * --port 5432 is set on tebow to standard postgresql on docker; --port 5434 is set to 512mb limit version
    * --data_kind
      * --data_kind gen_shuffle ---> uses generated data file but just shuffles values
      * --data_kind gen_shuffle2 ---> uses generated data file + replaces NULLs w/ random values + shuffles
    
    
  • create table using random values from the domain + evaluate

    • E.g. of evaluating data gene
    python3 create_table.py --inp_fn n.csv --port 5432 --data_kind random_domain2
    python3 eval_data.py --inp_to_eval n --data_kind random_domain2 --num_queries 100 --port 5432

SOSD version

Brief Notes

  • For Non-SCOPE workloads, ParsingSQLs.ipynb file should handle going from SQL strings to op_df.csv, expr_df.csv files

  • For SCOPE workloads, handled in filter_constants.ipynb;

  • TODO: need to clean up other SCOPE analysis files to be more consistent etc.

  • For CEB, IMDb, we load in the cardinalities from the qrep objects; For tpcds etc., we use the script get_rowcounts.py with the appropriate WK setup to get the cardinalities. Notice: this requires the creation of the same scale DB etc.

  • TODO: tpch / tpcds parsing needs to handle edge cases better; seems to be accidentally converting OR statements to AND? Sanity check: no single table cardinality should be 0 in expr_df.csv

  • expr_df.csv ---> literal_df.csv is done in get_const_rowcounts.py

About

analyzing workloads

License:MIT License


Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%