aws / aws-cdk-rfcs

RFCs for the AWS CDK

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Glue DataBrew L2 Construct

jaidisido opened this issue · comments

Description

AWS Glue DataBrew is a data preparation service that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. It consists of 250+ transformations (e.g. correct invalid values, filter out anomalies, run data quality...) that can be automated and applied on data. At the moment, only L1 constructs for Glue DataBrew are supported. They are as follows:

  • CfnProject (AWS::DataBrew::Project):
    • An interactive data preparation workspace where a collection of related items (data, transformations, recipes...) are managed
  • CfnDataset (AWS::DataBrew::Dataset):
    • Dataset simply means a set of data—rows or records that are divided into columns or fields
  • CfnRecipe (AWS::DataBrew::Recipe):
    • A set of instructions or steps for data that you want DataBrew to act on. A recipe can contain many steps, and each step can contain many actions (e.g. filter, groupby...)
  • CfnJob (AWS::DataBrew::Job):
    • Transforms data by running the instructions that were set up in the recipe
  • CfnRuleset (AWS::DataBrew::Ruleset):
    • Set of rules that can be used in a profile job to validate data quality
  • CfnSchedule (AWS::DataBrew::Schedule):
    • Schedule for one or more Glue DataBrew jobs. Can be a specific date/time or on regular intervals

Among the reasons why L2 constructs would be justified, is because of how AWS Glue DataBrew recipes are published. Every time the user modifies a Glue DataBrew recipe, they must publish a new recipe version. At the moment, this process can only be done from the AWS console, CLI or SDK. It's not possible to publish a new recipe version via IaC (i.e. CFN). One possible implementation would be to have a custom resource deployed for each recipe that would automatically publish a new version whenever the recipe is modified in the CDK code. An equivalent implementation exists for the BucketDeployment construct for example.

Roles

Role User
Proposed by @jaidisido
Author(s) @alias, @alias, @alias
API Bar Raiser @alias
Stakeholders @alias, @alias, @alias

See RFC Process for details

Workflow

  • Tracking issue created (label: status/proposed)
  • API bar raiser assigned (ping us at #aws-cdk-rfcs if needed)
  • Kick off meeting
  • RFC pull request submitted (label: status/review)
  • Community reach out (via Slack and/or Twitter)
  • API signed-off (label api-approved applied to pull request)
  • Final comments period (label: status/final-comments-period)
  • Approved and merged (label: status/approved)
  • Execution plan submitted (label: status/planning)
  • Plan approved and merged (label: status/implementing)
  • Implementation complete (label: status/done)

Author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue so that the RFC table in README gets
updated.

commented

This would be awesome, we've found managing Databrew via CDK/Cfn unusable due to the issues you've highlighted here.
I'm currently deleting the job/recipe when a recipe update is needed, which is pretty painful as needs two deployments - one to delete, one to recreate.

I've looked into creating a custom resource to handle a proper publish of the update recipe, but we've since decided to just move to Glue Studio instead.

Hopefully this gets sorted one day!

Closing this ticket. We believe the functionality is beneficial, but does not intersect with the core framework and should be vended and maintained separately.