Abdi-29 / pg_analytics

Analytical table access method for Postgres

Home Page:https://paradedb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pg_analytics

Important Notice

pg_analytics has been succeeded by pg_lakehouse. Whereas pg_analytics is designed to query over data stored within Postgres, pg_lakehouse queries over external data lakes like S3 and table formats like Iceberg, storing no data inside Postgres. Because we are strong believers of separation of storage and compute and have seen greater demand for pg_lakehouse, we're choosing to focus our development efforts on pg_lakehouse for the next few months.

As a result, pg_analytics has been moved out of the main ParadeDB repo. It is possible that development on pg_analytics will resume in the future.

Overview

pg_analytics is an extension that accelerates analytical query processing inside Postgres. The performance of analytical queries that leverage pg_analytics is comparable to the performance of dedicated OLAP databases — without the need to extract, transform, and load (ETL) the data from your Postgres instance into another system. The purpose of pg_analytics is to be a drop-in solution for fast analytics in Postgres with zero ETL.

The primary dependencies are:

How It Works

These libraries are the building blocks of many modern analytical databases and enable column-oriented storage, efficient data compression, and vectorized query execution within Postgres. Please see our blog post for a deep dive into how it works.

Benchmarks

With pg_analytics installed, ParadeDB is the fastest Postgres-based analytical database and outperforms many specialized OLAP systems. On Clickbench, ParadeDB is 94x faster than regular Postgres, 8x faster than Elasticsearch, and almost ties Clickhouse.

Clickbench Results

For an apples-to-apples comparison, these benchmarks were run on a c6a.4xlarge with 500GB storage. None of the databases were tuned. The (Parquet, single) Clickhouse variant was selected because it most closely matches ParadeDB's Parquet storage.

You can view the ParadeDB ClickBench results against other Postgres-compatible OLAP databases here.

Getting Started

This toy example demonstrates how to get started.

CREATE EXTENSION pg_analytics;
-- Create a parquet table
CREATE TABLE t (a int) USING parquet;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a parquet table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;

Parquet Tables

You can interact with parquet tables the same way as with normal Postgres tables. However, there are a few exceptions.

Append Only

Because column-oriented storage formats are not designed for fast updates, parquet tables are append only. This means that parquet tables do not support UPDATE and DELETE clauses. Data that is frequently updated should be stored in regular Postgres heap tables.

Storage Optimization

The VACUUM FULL <table_name> command is used to optimize a table's storage by bin-packing small Parquet files into larger files, which can significantly improve query time and compression.

Roadmap

pg_analytics is currently in beta.

Known Limitations

As pg_analytics becomes production-ready, many of these will be resolved.

  • UPDATE and DELETE
  • Some Postgres types, notably json and timestamptz
  • Write-ahead log (WAL) support
  • Collations
  • INSERT ... ON CONFLICT clauses

Development

Install Rust

To develop the extension, first install Rust v1.76.0 using rustup. We will soon make the extension compatible with newer versions of Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install 1.76.0

# We recommend setting the default version to 1.76.0 for consistency across your system
rustup default 1.76.0

Note: While it is possible to install Rust via your package manager, we recommend using rustup as we've observed inconcistencies with Homebrew's Rust installation on macOS.

Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project:

Install Postgres

# macOS
brew install postgresql@16

# Ubuntu
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt-get update && sudo apt-get install -y postgresql-16 postgresql-server-dev-16

If you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the pg_config binary to your path before continuing:

export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"

Install pgrx

Then, install and initialize pgrx:

# Note: Replace --pg16 with your version of Postgres, if different (i.e. --pg15, --pg14, etc.)
cargo install --locked cargo-pgrx --version 0.12.0-alpha.1

# macOS arm64
cargo pgrx init --pg16=/opt/homebrew/opt/postgresql@16/bin/pg_config

# macOS amd64
cargo pgrx init --pg16=/usr/local/opt/postgresql@16/bin/pg_config

# Ubuntu
cargo pgrx init --pg16=/usr/lib/postgresql/16/bin/pg_config

If you prefer to use a different version of Postgres, update the --pg flag accordingly.

Note: While it is possible to develop using pgrx's own Postgres installation(s), via cargo pgrx init without specifying a pg_config path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's.

Configure Shared Preload Libraries

This extension uses Postgres hooks to intercept Postgres queries. In order to enable these hooks, the extension must be added to shared_preload_libraries inside postgresql.conf. If you are using Postgres 16, this file can be found under ~/.pgrx/data-16.

# Inside postgresql.conf
shared_preload_libraries = 'pg_analytics'

Run Without Optimized Build

The extension can be developed with or without an optimized build. An optimized build improves query times by 10-20x but also significantly increases build times.

To launch the extension without an optimized build, run

cargo pgrx run

Run With Optimized Build

First, switch to latest Rust Nightly (as of writing, 1.77) via:

rustup update nightly
rustup override set nightly

Then, reinstall pgrx for the new version of Rust:

cargo install --locked cargo-pgrx --version 0.12.0-alpha.1 --force

Finally, run to build in release mode with SIMD:

cargo pgrx run --release

Note that this may take several minutes to execute.

To revert back to the stable version of Rust, run:

rustup override unset

Run Benchmarks

We support two types of benchmarks: ClickBench and TPC-H. ClickBench is ideal for testing analytical queries, while TPC-H is ideal for testing JOINs. To run the benchmarks, cd into the benchmarks/clickbench or benchmarks/tpch directory and run ./benchmark.sh -t <flag>. The -t flag is the version to benchmark, either x.y.z or latest to pull a version from DockerHub, or local to build the Docker image locally.

About

Analytical table access method for Postgres

https://paradedb.com

License:GNU Affero General Public License v3.0


Languages

Language:Rust 93.2%Language:Shell 6.8%