aws-samples / aws-analytics-reference-architecture

Currently, BatchReplayer is consuming PreparedDataset to generate data. We can provide a new construct to prepare the data for replay during provisioning of the CDK application.

This construct can take a source dataset as input parameters and run a synchronous AWS Glue job to modify the dataset and make it consumable by the BatchReplayer

Pre-requisites for BatchReplayer are listed in the PreparedDataset construct documentation

aws-analytics-reference-architecture/core/src/data-generator/prepared-dataset.ts

Line 47 in a000619

* A PreparedDataset has following properties:

Add some quality checks to prevent the preparation from failing.

Ensure the PySpark script is packaged into the core library.

Data generator should provide BYOD feature