MartijnVisser / cc-for-apache-flink-demos

Confluent Cloud for Apache Flink® demos

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This repository contains Flink SQL demo queries that can be used on Confluent Cloud for Apache Flink®.

Running these queries requires 4 topics to be set up using the Datagen connector, using AVRO and Schema Registry.

  • Topic clickstream -> Datagen template SHOE_CLICKSTREAM
  • Topic customers -> Datagen template SHOE_ORDERS
  • Topic orders -> Datagen template SHOE_CUSTOMERS
  • Topic shoes -> Datagen template SHOES

Explore your data

SELECT * FROM orders;

Count number of unique orders per hour

SELECT window_start, window_end, COUNT(DISTINCT order_id) as nr_of_orders
  FROM TABLE(
    TUMBLE(TABLE orders, DESCRIPTOR($rowtime), INTERVAL '1' HOUR))
  GROUP BY window_start, window_end;

Create table including underlying Kafka topic

CREATE TABLE sales_per_hour (
 window_start TIMESTAMP(3),
 window_end   TIMESTAMP(3),
 nr_of_orders BIGINT
);

Materialize number of unique orders per hour in newly created topic

INSERT INTO sales_per_hour
 SELECT window_start, window_end, COUNT(DISTINCT order_id) as nr_of_orders
    FROM TABLE(
      TUMBLE(TABLE `demo`.`webshop_team`.`orders`, DESCRIPTOR($rowtime), INTERVAL '1' HOUR))
    GROUP BY window_start, window_end;

Deduplicate table

SELECT id, name, brand
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY $rowtime DESC) AS row_num
  FROM `shoes`)
WHERE row_num = 1

Enrich clickstream data by joining with deduplicated table

WITH uniqueShoes AS (
SELECT id, name, brand
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY $rowtime DESC) AS row_num
  FROM `demo`.`lkc-6wk6y2`.`shoes`)
WHERE row_num = 1
)
SELECT 
  c.`$rowtime`,
  c.product_id,
  s.name, 
  s.brand
FROM 
  clickstream c
  INNER JOIN uniqueShoes s ON c.product_id = s.id;

NOTE: Make sure that you select the correct full qualified names, because they are specific for your setup

Measure average view time of less then 30

SELECT *
FROM clickstream
    MATCH_RECOGNIZE (
        PARTITION BY user_id
        ORDER BY `$rowtime`
        MEASURES
            FIRST(A.`$rowtime`) AS start_tstamp,
            LAST(A.`$rowtime`) AS end_tstamp,
            AVG(A.view_time) AS avgViewTime
        ONE ROW PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (A+ B)
        DEFINE
            A AS AVG(A.view_time) < 30
    ) MR;

About

Confluent Cloud for Apache Flink® demos