misleem / dbtvault_greenplum_demo

dbtVault + Greenplum demo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

1. Create Greenplum cluster on Yandex.Cloud

Data Warehouse Analyst – Analytics Engineer @ OTUS.ru

Request access to Manged Greenplum Service (beta), than create a cluster.

Configuration I have used:

2. Generate data with DBGen

Spin up a virtual machine, install libs, generate data.

Or just get what I have prepared for you from s3://otus-dwh/tpch-dbgen/ (Yandex Object Storage)

ssh -i ~/.ssh/key dbgen@{ip}  # ssh to VM

sudo apt-get install -y gcc git awscli postgresql # install libs

git clone https://github.com/electrum/tpch-dbgen.git # TPCH generator
make makefile.suite

./dbgen -v -h -s 10 # generate data

for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done; # convert to a CSV format compatible with PostgreSQL

aws configure # to sync with S3

aws --endpoint-url=https://storage.yandexcloud.net s3 sync . s3://otus-dwh/tpch-dbgen/ \
	--exclude=* \
	--include=*.csv \
	--acl=public-read \
	--dryrun

Read more at:

3. COPY to database

First create table definitions.

Then load data into it.

-- DDL scripts to create table

CREATE TABLE customer
(C_CUSTKEY INT, 
C_NAME VARCHAR(25),
C_ADDRESS VARCHAR(40),
C_NATIONKEY INTEGER,
C_PHONE CHAR(15),
C_ACCTBAL DECIMAL(15,2),
C_MKTSEGMENT CHAR(10),
C_COMMENT VARCHAR(117))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (C_CUSTKEY);

CREATE TABLE lineitem
(L_ORDERKEY BIGINT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INTEGER,
L_QUANTITY DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_RETURNFLAG CHAR(1),
L_LINESTATUS CHAR(1),
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT CHAR(25),
L_SHIPMODE CHAR(10),
L_COMMENT VARCHAR(44))
WITH (appendonly=true, orientation=column, compresstype=ZSTD)
DISTRIBUTED BY (L_ORDERKEY,L_LINENUMBER)
PARTITION BY RANGE (L_SHIPDATE)
(start('1992-01-01') INCLUSIVE end ('1998-12-31') INCLUSIVE every (30),
default partition others);

CREATE TABLE nation
(N_NATIONKEY INTEGER, 
N_NAME CHAR(25), 
N_REGIONKEY INTEGER, 
N_COMMENT VARCHAR(152))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (N_NATIONKEY);

CREATE TABLE orders
(O_ORDERKEY BIGINT,
O_CUSTKEY INT,
O_ORDERSTATUS CHAR(1),
O_TOTALPRICE DECIMAL(15,2),
O_ORDERDATE DATE,
O_ORDERPRIORITY CHAR(15), 
O_CLERK  CHAR(15), 
O_SHIPPRIORITY INTEGER,
O_COMMENT VARCHAR(79))
WITH (appendonly=true, orientation=column, compresstype=ZSTD)
DISTRIBUTED BY (O_ORDERKEY)
PARTITION BY RANGE (O_ORDERDATE)
(start('1992-01-01') INCLUSIVE end ('1998-12-31') INCLUSIVE every (30),
default partition others);

CREATE TABLE part
(P_PARTKEY INT,
P_NAME VARCHAR(55),
P_MFGR CHAR(25),
P_BRAND CHAR(10),
P_TYPE VARCHAR(25),
P_SIZE INTEGER,
P_CONTAINER CHAR(10),
P_RETAILPRICE DECIMAL(15,2),
P_COMMENT VARCHAR(23))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (P_PARTKEY);

CREATE TABLE partsupp
(PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INTEGER,
PS_SUPPLYCOST DECIMAL(15,2),
PS_COMMENT VARCHAR(199))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (PS_PARTKEY,PS_SUPPKEY);

CREATE TABLE region
(R_REGIONKEY INTEGER, 
R_NAME CHAR(25),
R_COMMENT VARCHAR(152))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (R_REGIONKEY);

CREATE TABLE supplier 
(S_SUPPKEY INT,
S_NAME CHAR(25),
S_ADDRESS VARCHAR(40),
S_NATIONKEY INTEGER,
S_PHONE CHAR(15),
S_ACCTBAL DECIMAL(15,2),
S_COMMENT VARCHAR(101))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (S_SUPPKEY);

On a VM with installed psql execute COPY pointing to local csv files:

export GREENPLUM_URI="postgres://greenplum:<pass>@<host>:5432/postgres"
psql $GREENPLUM_URI

\copy customer from  '/home/dbgen/tpch-dbgen/data/customer.csv' WITH (FORMAT csv, DELIMITER '|');
\copy lineitem from  '/home/dbgen/tpch-dbgen/data/lineitem.csv' WITH (FORMAT csv, DELIMITER '|');
\copy nation from  '/home/dbgen/tpch-dbgen/data/nation.csv' WITH (FORMAT csv, DELIMITER '|');
\copy orders from  '/home/dbgen/tpch-dbgen/data/orders.csv' WITH (FORMAT csv, DELIMITER '|');
\copy part from  '/home/dbgen/tpch-dbgen/data/part.csv' WITH (FORMAT csv, DELIMITER '|');
\copy partsupp from  '/home/dbgen/tpch-dbgen/data/partsupp.csv' WITH (FORMAT csv, DELIMITER '|');
\copy region from  '/home/dbgen/tpch-dbgen/data/region.csv' WITH (FORMAT csv, DELIMITER '|');
\copy supplier from  '/home/dbgen/tpch-dbgen/data/supplier.csv' WITH (FORMAT csv, DELIMITER '|');

4. Run dbtVault + Greenplum demo

1. First read the official guide:

dbtVault worked example

2. Clone repo with dbt project

Clone demo repo: https://github.com/kzzzr/dbtvault_greenplum_demo

git clone https://github.com/kzzzr/dbtvault_greenplum_demo.git

3. Configure database connection

Example profiles.yml

config:
  send_anonymous_usage_stats: False
  use_colors: True
  partial_parse: True

dbtvault_greenplum_demo:
  outputs:
    dev:
      type: postgres
      threads: 2
      host: {yc-greenplum-host}
      port: 5432
      user: greenplum
      pass: {yc-greenplum-pass}
      dbname: postgres
      schema: public
  target: dev

4. Make sure you run on dbt==0.19.0

You may use repo's Pipfile with pipenv or install dbt yourself

pipenv install
pipenv shell

dbt debug # check if OK

5. Install dependencies

Initial repo is intended to run on Snowflake only.

I have forked it and adapted to run on Greenplum/PostgreSQL. Check out what has been changed: 47e0261cea67c3284ea409c86dacdc31b1175a39

packages.yml:

packages:
  # - package: Datavault-UK/dbtvault
  #   version: 0.7.3
  - git: "https://github.com/kzzzr/dbtvault.git"
    revision: master
    warn-unpinned: false

Install package:

dbt deps

6. Adapt models to Greenplum/PostgreSQL

Check out the commit history.

  • a97a224 - adapt prepared staging layer for greenplum - Artemiy Kozyr (HEAD -> master, kzzzr/master)
  • dfc5866 - configure raw layer for greenplum - Artemiy Kozyr
  • bba7437 - configure data sources for greenplum - Artemiy Kozyr
  • aa25600 - configure package (adapted dbt_vault) for greenplum - Artemiy Kozyr
  • eafed95 - configure dbt_project.yml for greenplum - Artemiy Kozyr

7. Run models step-by-step

Load one day to Data Vault structures:

dbt run -m tag:raw
dbt run -m tag:stage
dbt run -m tag:hub
dbt run -m tag:link
dbt run -m tag:satellite
dbt run -m tag:t_link

8. Load next day

Simulate next day load by incrementing load_date varible:

# dbt_profiles.yml

vars:
  load_date: '1992-01-08' # increment by one day '1992-01-09'

About

dbtVault + Greenplum demo

License:Apache License 2.0