aws-samples / aws-analytics-reference-architecture

Home Page:https://aws.amazon.com/blogs/opensource/adding-cdk-constructs-to-the-aws-analytics-reference-architecture/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data generator should provide BYOD feature

vgkowski opened this issue · comments

Currently, BatchReplayer is consuming PreparedDataset to generate data. We can provide a new construct to prepare the data for replay during provisioning of the CDK application.

This construct can take a source dataset as input parameters and run a synchronous AWS Glue job to modify the dataset and make it consumable by the BatchReplayer

Pre-requisites for BatchReplayer are listed in the PreparedDataset construct documentation

* A PreparedDataset has following properties:

Add some quality checks to prevent the preparation from failing.

Ensure the PySpark script is packaged into the core library.