rafaelportugal / dataflow-dynamic-schema

An example for dynamically mutating BigQuery schemas within a streaming Beam pipeline.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataflow Dynamic Schema

Commonly when consuming data from 3rd parties, you may find yourself subject to having to react to upstream schema changes. A rudimentary way to handle this scenario may be to coordinate releases with the partner or sink records which have mutated schema into a dead-letter table. However, it is also possible to handle this scenario dynamically in real-time within the pipeline itself. This repository contains an example pipeline which, on failed inserts to BigQuery, will mutate the output table with any additive changes on the incoming record. This frees you up from the operational overhead of having to manage these changes and keeps your schema up-to-date in the face of constant change.

This pipeline is largely based on the How to handle mutating JSON schemas in a streaming pipeline, with Square Enix blog post.

alt text

Pipeline

DynamicSchemaPipeline - A pipeline which consumes Avro messages from Pub/Sub and outputs records to BigQuery. If the Avro contains additional fields, the output BigQuery table will be mutated to automatically add the changes.

Getting Started

Requirements

  • Java 8
  • Maven 3

Building the Project

Build the entire project using the maven compile command.

mvn clean && mvn compile

About

An example for dynamically mutating BigQuery schemas within a streaming Beam pipeline.

License:Apache License 2.0


Languages

Language:Java 100.0%