j-confusatron / DBQueryOptimizer

A system that learns to optimize database query plans.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DB Query Plan Optimizer

Paper

Traditional database query planning uses dynamic programming to build cardinality estimations from the bottom up for query plan candidates. Cardinality estimates are just that, an estimate. Further, dynamic programming algorithms, such as Selinger's, break down as query complexity increases, requiring the databse engine to instead employ heuristics to make educated guesses.

This project introduces an intelligent system that learns to predict the optimal query plan to execute. Two pipelines are used:

  • A production pipeline integrates tightly with the database to estimate cost for candidate plans and select the plan associated with the lowest cost estimate. All observed stateful data is recorded in a training buffer.
  • A training pipeline pulls samples from the training buffer and further trains a neural network to accurately predict costs. Periodically, the current best model produced by the training pipeline is deployed to production.

By using reinforcement learning techniques, the system is able to produce its own sample data, from which it trains its self. The data produced is abstracted from specific schema details, such that the models produced may generalize their performance to unseen environments.

Build Information

Build Environment

  • Development: Windows 11 / Anaconda 4.10.3 / Python 3.8.13
  • Python Server + Database: Ubuntu 20.04.4 / Postgres 12.10 / Anaconda 4.10.3 / Python 3.8.13

Libraries / Components

  • Postgres 12.x
  • gcc + make
  • Python 3.8
  • Pytorch
  • Numpy
  • Matplotlib
  • Pandas
  • Tqdm

Postgres Function Hook

This project relies on prior work done on Bao (Marcus, et al.) to integrate with Postgres. This is done by building a function hook, that intercepts query plan calls. Instructions on building and deploying the query hook are here.

Datasets

This system was trained using a variation of the IMDB JOB dataset. The production pipeline will produce samples for any database with traffic.

Usage

Server

The server is responsible for interacting with Postgres. Postgres will POST query plan candidates and observed costs to it. It records learning information for the training pipeline.

Config server.cfg

  • server -> Port: Port to run server on
  • server -> ListenOn: 'localhost' or IP to host server on
  • model -> Epsilon: 0-1, likelihood that model will select a random plan
  • model -> ForceArm: 0-n, where n is number of candidate plans; force server to select plan at index; -1 to disable
  • buffer -> FlushAfter: Number of queries to predict before flushing training buffer to disk

Run

python server.py

Query Runner

Run all queries against a Postgres instance.

Config server.cfg

  • db -> database: The database to query
  • db -> user: db user name
  • db -> password: db password
  • db -> host: db host
  • db -> port: db port

Run

python query_runner.py

  • --dataset: One of: imdb-train, imdb-test, so

Training Pipeline

Train a new model with the current training buffer.

Run

python train_model.py

About

A system that learns to optimize database query plans.

License:Apache License 2.0


Languages

Language:Python 100.0%