alpinegizmo / flink-sql-demo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scenario and Data

  • What do we show in this demo

    • Flink SQL processing data from different storage systems
    • Flink SQL using Hive Metastore as an external, persistent catalog
    • Batch/Stream unification of queries in action
    • Different ways to join dynamic data
    • Creating Tables with DDL
    • Maintaining materialize views with continuous SQL queries in Kafka and MySQL
  • Scenario is an online store receiving orders.

  • The order system consists of six tables (derived from the well-known TPC-H benchmark)

    • PROD_ORDERS one row for each order
    • PROD_LINEITEM: individual items of an order
    • PROD_CUSTOMER: customer data
    • PROD_NATION: denormalized nation data
    • PROD_REGION: denormalized region data
    • PROD_RATES: exchange rates for currencies
    • PROD_RATES_HISTORY: history of all currency exchange rates

Depending on their update characteristic (frequency, insert-only) tables are stored in different systems:

  • Kafka: PROD_ORDERS, PROD_LINEITEM, PROD_RATES_HISTORY
  • MySQL: PROD_CUSTOMER, PROD_NATION, PROD_REGION, PROD_RATES

Get the Data

Please download the demo data from Google Drive and extract the zip archive into the ./data folder (as ./data/*.tbl).

https://drive.google.com/file/d/15LWUBGZenWaW3R_WNxqvcTQOuA_Kgtjv

The folder will be mounted by the Docker containers.

Build the Docker Images

docker-compose build

Start the Docker Environment

docker-compose up -d

Show the Demo Environment

The demo environment consists of the following components.

Kafka + Data Provider (+ Zookeeper)

  • Show Kafka data streams
docker-compose exec kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --from-beginning --topic orders
docker-compose exec kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --from-beginning --topic lineitem
docker-compose exec kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --from-beginning --topic rates

MySQL

  • Show data in MySQL
docker-compose exec mysql mysql -Dsql-demo -usql-demo -pdemo-sql

SHOW TABLES;
DESCRIBE PROD_CUSTOMER;
SELECT * FROM PROD_CUSTOMER LIMIT 10;
quit;

Grafana

Minio (S3-compatible Storage)

Flink JobManager and TaskManager

SQL client + Hive Metastore

  • SQL client provides SQL environment and submits queries to Flink for execution
  • Hive Metastore presistently stores catalog data

The Demo

Start the SQL Client

docker-compose exec sql-client ./sql-client.sh

Show Tables

All dynamic (Kafka-backed) tables are in the default catalog:

SHOW TABLES;
DESCRIBE prod_orders;
SELECT * FROM prod_orders;

All static (MySQL-backed) tables are in the Hive catalog:

USE CATALOG hive;
SHOW TABLES;
DESCRIBE prod_nation;
SELECT * FROM prod_nation;

Running Queries on Static and Dynamic Tables

  • Flink SQL unifies batch and stream processing.
  • Many queries can be executed on static and dynamic table.

Take a Snapshot of a Dynamic Table

  • Create a table backed by a file in S3.
USE CATALOG hive;
CREATE TABLE dev_orders (
  o_orderkey      INTEGER,
  o_custkey       INTEGER,
  o_orderstatus   STRING,
  o_totalprice    DOUBLE,
  o_currency      STRING,
  o_ordertime     TIMESTAMP(3),
  o_orderpriority STRING,
  o_clerk         STRING, 
  o_shippriority  INTEGER,
  o_comment       STRING,
  WATERMARK FOR o_ordertime AS o_ordertime - INTERVAL '5' MINUTE
) WITH (
  'connector.type' = 'filesystem',
  'connector.path' = 's3://sql-demo/orders.tbl',
  'format.type' = 'csv',
  'format.field-delimiter' = '|'
);
  • Write a sample of a dynamic streaming table to the static table.
INSERT INTO dev_orders SELECT * FROM default_catalog.default_database.prod_orders;
SELECT * FROM dev_orders;
SELECT COUNT(*) AS rowCnt FROM dev_orders;

Run a Query on the Snapshotted Data

  • Using the batch engine for better efficiency
SET execution.type=batch;
  • Compute revenue per currency and minute
SELECT
  CEIL(o_ordertime TO MINUTE) AS `minute`,
  o_currency AS `currency`,
  SUM(o_totalprice) AS `revenue`,
  COUNT(*) AS `orderCnt`
FROM dev_orders
GROUP BY
  o_currency,
  CEIL(o_ordertime TO MINUTE);
  • Run the same query as a continuous streaming query
SET execution.type=streaming;
  • Show why we should streamify the query
SET execution.result-mode=changelog;
  • We can streamify the query a bit with a TUMBLE window
SELECT
  TUMBLE_END(o_ordertime, INTERVAL '1' MINUTE) AS `minute`,
  o_currency AS `currency`,
  SUM(o_totalprice) AS `revenue`,
  COUNT(*) AS `orderCnt`
FROM dev_orders
GROUP BY
  o_currency,
  TUMBLE(o_ordertime, INTERVAL '1' MINUTE);
  • We can execute this query also with the batch engine
  • Reset to previous settings
SET execution.result-mode=table;
SET execution.type=batch;

Move query to streaming data

  • We can run the same query on the table backed by a Kafka topic
SET execution.type=streaming;
  • We only change the table in the FROM clause
SELECT
  TUMBLE_END(o_ordertime, INTERVAL '1' MINUTE) AS `minute`,
  o_currency AS `currency`,
  SUM(o_totalprice) AS `revenue`,
  COUNT(*) AS `orderCnt`
FROM default_catalog.default_database.prod_orders
GROUP BY
  o_currency,
  TUMBLE(o_ordertime, INTERVAL '1' MINUTE);

Five Ways to Join Tables

  • Dynamic tables can be joined just like static tables.
    • However, joins can be inefficient, if you ignore the dynamic property of tables.
    • There are common patterns to join that can be efficiently executed with low resource consumption.

Regular Join on Static Tables

SET execution.type=batch;
USE CATALOG hive;
  • show customers and their orders by region and priority for a specific day
  • query joins the snapshot of the orders table
SELECT
  r_name AS `region`,
  o_orderpriority AS `priority`,
  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
  COUNT(o_orderkey) AS `number_of_orders`
FROM dev_orders
JOIN prod_customer ON o_custkey = c_custkey
JOIN prod_nation ON c_nationkey = n_nationkey
JOIN prod_region ON n_regionkey = r_regionkey
WHERE
  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
  AND NOT o_orderpriority = '4-NOT SPECIFIED'
GROUP BY r_name, o_orderpriority
ORDER BY r_name, o_orderpriority;

Regular Join on Dynamic Tables

  • We can run the same query on the prod_orders table
  • Change dev_orders to prod_orders
  • Remove ORDER BY clause
SET execution.type=streaming;
USE CATALOG default_catalog;
SELECT
  r_name AS `region`,
  o_orderpriority AS `priority`,
  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
  COUNT(o_orderkey) AS `number_of_orders`
FROM prod_orders
JOIN prod_customer ON o_custkey = c_custkey
JOIN prod_nation ON c_nationkey = n_nationkey
JOIN prod_region ON n_regionkey = r_regionkey
WHERE
  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
  AND NOT o_orderpriority = '4-NOT SPECIFIED'
GROUP BY r_name, o_orderpriority;
  • Notable properties of this join
    • The batch tables are read just once. Updates are not considered!
    • All input tables are completely materialized in state.

Interval Join

  • A common requirement is to join events of two (or more) dynamic tables that are related with each other in a temporal context, for example events that happened around the same time.
  • Flink SQL features special optimizations for such joins.
  • First switch to the default catalog (which contains all dynamic tables)
USE CATALOG default_catalog;
  • Join prod_lineitem and prod_orders to compute open lineitems with urgent order priority.
SELECT
  o_ordertime AS `ordertime`,
  o_orderkey AS `order`,
  l_linenumber AS `linenumber`,
  l_partkey AS `part`,
  l_suppkey AS `supplier`,
  l_quantity AS `quantity`
FROM prod_lineitem
JOIN prod_orders ON o_orderkey = l_orderkey
WHERE
  l_ordertime BETWEEN o_ordertime - INTERVAL '5' MINUTE AND o_ordertime AND
  l_linestatus = 'O' AND
  o_orderpriority = '1-URGENT';
  • Notable properties of the join
    • Event-time semantics (processing time also supported)
    • Small state requirements. Only the 5 minute tails of the streams need to be held in state.

Enrichment Join with Lookup Table in MySQL

  • Enriching an insert-only, dynamic table with data from a static or slowly changing table

  • Example query: enrich prod_lineitem table with current exchange rate from prod_rates table to compute EUR-normalize amounts.

USE CATALOG default_catalog;
SELECT
  l_proctime AS `querytime`,
  l_orderkey AS `order`,
  l_linenumber AS `linenumber`,
  l_currency AS `currency`,
  rs_rate AS `cur_rate`, 
  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
FROM prod_lineitem
JOIN hive.`default`.prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
WHERE
  l_linestatus = 'O';
  • Check PROD_RATES table in MySQL and update a rate
docker-compose exec mysql mysql -Dsql-demo -usql-demo -pdemo-sql
SELECT * FROM PROD_RATES;
UPDATE PROD_RATES SET RS_TIMESTAMP = '2020-04-01 01:00:00.000', RS_RATE = 1.234 WHERE RS_SYMBOL='USD';
  • Notable properties of the join
    • Processing time semantics: rates are looked up from MySQL table as of the processing time of a row
    • MySQL table can be updated while the job is running

Enrichment Join against Temporal Table

  • Same use case as before.
  • Instead of looking up rates from MySQL, we ingest updates from another Kafka table.
  • Kafka table is accessed via a TemporalTableFunction prod_rates_temporal which looks up the most-recent exchange rate for a currency.
USE CATALOG default_catalog;
SELECT
  l_ordertime AS `ordertime`,
  l_orderkey AS `order`,
  l_linenumber AS `linenumber`,
  l_currency AS `currency`,
  rs_rate AS `cur_rate`, 
  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
FROM
  prod_lineitem,
  LATERAL TABLE(prod_rates_temporal(l_ordertime))
WHERE rs_symbol = l_currency AND
  l_linestatus = 'O';
  • Notable properties of the join
    • Event time semantics: rates are looked up from temporal table (backed by a Kafka topic) as of the order time of the item.
    • Rates change as records are produced into Kafka topic

Matching Patterns on Dynamic Tables

  • Matching patterns on new data with low latency is a common use case for stream processing

  • SQL:2016 introduced the MATCH_RECOGNIZE clause to match patterns on tables

  • The combination of MATCH_RECOGNIZE and dynamic tables is very powerful

  • Flink supports MATCH_RECOGNIZE since several releases

  • Find customers that have changed their delivery behavior.

    • Search for a pattern where the last x lineitems had regular shippings
    • But now the customer whats to pay on delivery (collect on delivery = CoD)
USE CATALOG default_catalog;
CREATE VIEW lineitem_with_customer AS
SELECT * FROM (
  SELECT
    o_custkey AS `custkey`,
    l_orderkey AS `orderkey`,
    l_linenumber AS `linenumber`,
    l_ordertime AS `ordertime`,
    l_shipinstruct AS `shipinstruct`
  FROM prod_lineitem
  JOIN prod_orders ON o_orderkey = l_orderkey
  WHERE
    l_ordertime BETWEEN o_ordertime - INTERVAL '5' MINUTE AND o_ordertime
);
SELECT *
FROM lineitem_with_customer
MATCH_RECOGNIZE(
  PARTITION BY custkey
  ORDER BY ordertime
  MEASURES
    COUNT(OTHER.orderkey) AS `last regular shipings`,
    LAST(OTHER.orderkey) AS `last regular order`,
    LAST(OTHER.linenumber) AS `last regular linenumber`,
    COD.orderkey AS `cod order`,
    COD.linenumber AS `cod linenumber`
  PATTERN (OTHER{5,} COD)
  DEFINE
    OTHER AS NOT shipinstruct = 'COLLECT COD',
    COD AS shipinstruct = 'COLLECT COD'
);

Maintaining Materialized Views

  • Continuous queries on dynamic tables update their result with low latency.
  • Results can be maintained in external systems similar to materialized views.

Writing Results to Kafka

  • Let's write some data back to Kafka
  • We use the window aggregation query that we've discussed before
USE CATALOG default_catalog;
SELECT
  TUMBLE_END(o_ordertime, INTERVAL '1' MINUTE) AS `minute`,
  o_currency AS `currency`,
  SUM(o_totalprice) AS `revenue`,
  COUNT(*) AS `orderCnt`
FROM default_catalog.default_database.prod_orders
GROUP BY
  o_currency,
  TUMBLE(o_ordertime, INTERVAL '1' MINUTE);
  • Create a Kafka-backed table minute_stats with JSON encoding
CREATE TABLE minute_stats (
  `minute` TIMESTAMP(3),
  `currency` STRING,
  `revenueSum` DOUBLE,
  `orderCnt` BIGINT,
  WATERMARK FOR `minute` AS `minute` - INTERVAL '10' SECOND
) WITH (
  'connector.type' = 'kafka',       
  'connector.version' = 'universal',
  'connector.topic' = 'minute_stats',
  'connector.properties.zookeeper.connect' = 'not-needed',
  'connector.properties.bootstrap.servers' = 'kafka:9092',
  'connector.startup-mode' = 'earliest-offset',
  'format.type' = 'json'
);
  • Automatic topic creation is enabled in Kafka

  • Write query result into table

INSERT INTO minute_stats
SELECT
  TUMBLE_END(o_ordertime, INTERVAL '1' MINUTE) AS `minute`,
  o_currency AS `currency`,
  SUM(o_totalprice) AS `revenue`,
  COUNT(*) AS `orderCnt`
FROM default_catalog.default_database.prod_orders
GROUP BY
  o_currency,
  TUMBLE(o_ordertime, INTERVAL '1' MINUTE);
  • Check results in Kafka
docker-compose exec kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --from-beginning --topic minute_stats

Writing Results to MySQL

  • We would like to store and mainain the result of the following query in MySQL.
USE CATALOG hive;
SELECT
  r_name AS `region`,
  HOUR(o_ordertime) AS `hour_of_day`,
  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
  COUNT(o_orderkey) AS `number_of_orders`
FROM (
  SELECT CAST(o_ordertime AS TIMESTAMP) o_ordertime, o_orderkey, o_custkey
  FROM default_catalog.default_database.prod_orders)
JOIN prod_customer ON o_custkey = c_custkey
JOIN prod_nation ON c_nationkey = n_nationkey
JOIN prod_region ON n_regionkey = r_regionkey
GROUP BY 
  r_name, 
  HOUR(o_ordertime);
  • Create a table in Flink
CREATE TABLE region_stats (
  region               STRING,
  hour_of_day          BIGINT,
  number_of_customers  BIGINT,
  number_of_orders     BIGINT
) WITH (
  'connector.type' = 'jdbc',
  'connector.url' = 'jdbc:mysql://mysql:3306/sql-demo',
  'connector.table' = 'REGION_STATS',
  'connector.driver' = 'com.mysql.jdbc.Driver',
  'connector.username' = 'sql-demo',
  'connector.password' = 'demo-sql',
  'connector.write.flush.interval' = '2s',
  'connector.write.flush.max-rows' = '1000'
);
  • The table is configured to emit updates every second (or every 1000 updates).

  • Write query result to MySQL

INSERT INTO region_stats
SELECT
  r_name AS `region`,
  HOUR(o_ordertime) AS `hour_of_day`,
  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
  COUNT(o_orderkey) AS `number_of_orders`
FROM (
  SELECT CAST(o_ordertime AS TIMESTAMP) o_ordertime, o_orderkey, o_custkey
  FROM default_catalog.default_database.prod_orders)
JOIN prod_customer ON o_custkey = c_custkey
JOIN prod_nation ON c_nationkey = n_nationkey
JOIN prod_region ON n_regionkey = r_regionkey
GROUP BY 
  r_name, 
  HOUR(o_ordertime);
  • Monitor query result in Grafana: http://localhost:3000
  • Go to dashboard region stats and set refresh rate to 1s

About


Languages

Language:Java 62.0%Language:Dockerfile 32.1%Language:Shell 5.9%