Glue DataBrew L2 Construct

Question

Glue DataBrew L2 Construct

jaidisido opened this issue 2 years ago · comments

Description

AWS Glue DataBrew is a data preparation service that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. It consists of 250+ transformations (e.g. correct invalid values, filter out anomalies, run data quality...) that can be automated and applied on data. At the moment, only L1 constructs for Glue DataBrew are supported. They are as follows:

CfnProject (AWS::DataBrew::Project):
- An interactive data preparation workspace where a collection of related items (data, transformations, recipes...) are managed
CfnDataset (AWS::DataBrew::Dataset):
- Dataset simply means a set of data—rows or records that are divided into columns or fields
CfnRecipe (AWS::DataBrew::Recipe):
- A set of instructions or steps for data that you want DataBrew to act on. A recipe can contain many steps, and each step can contain many actions (e.g. filter, groupby...)
CfnJob (AWS::DataBrew::Job):
- Transforms data by running the instructions that were set up in the recipe
CfnRuleset (AWS::DataBrew::Ruleset):
- Set of rules that can be used in a profile job to validate data quality
CfnSchedule (AWS::DataBrew::Schedule):
- Schedule for one or more Glue DataBrew jobs. Can be a specific date/time or on regular intervals

Among the reasons why L2 constructs would be justified, is because of how AWS Glue DataBrew recipes are published. Every time the user modifies a Glue DataBrew recipe, they must publish a new recipe version. At the moment, this process can only be done from the AWS console, CLI or SDK. It's not possible to publish a new recipe version via IaC (i.e. CFN). One possible implementation would be to have a custom resource deployed for each recipe that would automatically publish a new version whenever the recipe is modified in the CDK code. An equivalent implementation exists for the BucketDeployment construct for example.

Roles

Role	User
Proposed by	@jaidisido
Author(s)	@alias, @alias, @alias
API Bar Raiser	@alias
Stakeholders	@alias, @alias, @alias

See RFC Process for details

Workflow

Author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue so that the RFC table in README gets
updated.

Conor · Answer 1 · Thu Nov 03 2022 06:21:04 GMT+0800 (China Standard Time)

This would be awesome, we've found managing Databrew via CDK/Cfn unusable due to the issues you've highlighted here.
I'm currently deleting the job/recipe when a recipe update is needed, which is pretty painful as needs two deployments - one to delete, one to recreate.

I've looked into creating a custom resource to handle a proper publish of the update recipe, but we've since decided to just move to Glue Studio instead.

Hopefully this gets sorted one day!

awsmjs · Answer 2 · Fri Dec 15 2023 06:48:54 GMT+0800 (China Standard Time)

Closing this ticket. We believe the functionality is beneficial, but does not intersect with the core framework and should be vended and maintained separately.