beyond1920 / flink-faker

A data generator source connector for Flink SQL based on java-faker.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status

flink-faker

flink-faker is an Apache Flink table source that generates fake data based on the Java Faker expression provided for each column.

Checkout this demo web application for some example Java Faker expressions.

This project is inspired by voluble.

Package

mvn clean package

Usage

As ScanTableSource

CREATE TEMPORARY TABLE heros (
  `name` STRING,
  `power` STRING, 
  `age` INT
) WITH (
  'connector' = 'faker', 
  'fields.name.expression' = '#{superhero.name}',
  'fields.power.expression' = '#{superhero.power}',
  'fields.power.null-rate' = '0.05',
  'fields.age.expression' = '#{number.numberBetween ''0'',''1000''}'
);

SELECT * FROM heros;

As LookupTableSource

CREATE TEMPORARY TABLE location_updates (
  `character_id` INT,
  `location` STRING,
  `proctime` AS PROCTIME()
)
WITH (
  'connector' = 'faker', 
  'fields.character_id.expression' = '#{number.numberBetween ''0'',''100''}',
  'fields.location.expression' = '#{harry_potter.location}'
);

CREATE TEMPORARY TABLE characters (
  `character_id` INT,
  `name` STRING
)
WITH (
  'connector' = 'faker', 
  'fields.character_id.expression' = '#{number.numberBetween ''0'',''100''}',
  'fields.name.expression' = '#{harry_potter.characters}'
);

SELECT 
  c.character_id,
  l.location,
  c.name
FROM location_updates AS l
JOIN characters FOR SYSTEM_TIME AS OF proctime AS c
ON l.character_id = c.character_id;

Currently, the faker source supports the following data types:

  • CHAR
  • VARCHAR
  • STRING
  • TINYINT
  • SMALLINT
  • INTEGER
  • BIGINT
  • FLOAT
  • DOUBLE
  • DECIMAL
  • BOOLEAN
  • TIMESTAMP

Connector Options

Connector Option Default Description
number-of-rows None The number of rows to produce. If this is options is set, the source is bounded otherwise it is unbounded and runs indefinitely.
rows-per-second 10000 The maximum rate at which the source produces records.
fields.<field>.expression None The Java Faker expression to generate the values for this field.
fields.<field>.null-rate 0.0 Fraction of rows for which this field is null

On Timestamps

For rows of type TIMESTAMP, the corresponding Java Faker expression needs to return a timestamp formatted as EEE MMM dd HH:mm:ss zzz yyyy. Typically, you would use one of the following expressions:

CREATE TEMPORARY TABLE timestamp_example (
  `timestamp1` TIMESTAMP(3),
  `timestamp2` TIMESTAMP(3)
)
WITH (
  'connector' = 'faker', 
  'fields.timestamp1.expression' = '#{date.past ''15'',''SECONDS''}',
  'fields.timestamp2.expression' = '#{date.past ''15'',''5'',''SECONDS''}'
);

SELECT * FROM timestamp_example;

For timestamp1 Java Faker will generate a random timestamp that lies at most 15 seconds in the past. For timestamp2 Java Faker will generate a random timestamp, that lies at most 15 seconds in the past, but at least 5 seconds.

"One Of" Columns

The Java Faker expression to pick a random value from a list of options is not straight forward to get right. Actually, I did not manage to get Options.option work at all. As a workaround, I recommend using regexify for this use case.

CREATE TEMPORARY TABLE orders (
  `order_id` INT,
  `order_status` STRING
)
WITH (
  'connector' = 'faker', 
  'fields.order_id.expression' = '#{number.numberBetween ''0'',''100''}',
  'fields.order_status.expression' = '#{regexify ''(RECEIVED|SHIPPED|CANCELLED){1}''}'
);

SELECT * FROM orders;

License

Copyright © 2020 Konstantin Knauf

Distributed under Apache License, Version 2.0.

About

A data generator source connector for Flink SQL based on java-faker.

License:Apache License 2.0


Languages

Language:Java 100.0%