arquivei / data-meetup-beam

Example of Apache Beam pipeline that was presented in Arquivei's Data Meetup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data-meetup-beam

Example of Apache Beam pipeline that was presented in Arquivei's Data Meetup

How does it work?

Configuration

The script files inside /code/pipeline/scripts contains all variables used to process pipeline locally or in Dataflow.

  • Locally process variables - run_direct.sh - need to change GCS_BUCKET:
CONFIG_APACHEBEAM_RUNNER=direct
CONFIG_APACHEBEAM_TEMPLOCATION=gs://GCS_BUCKET/templocation
CONFIG_APACHEBEAM_STAGINGLOCATION=gs://GCS_BUCKET/staginglocation
  • Dataflow process variables - run_dataflow.sh - need to change GCS_BUCKET and GCP_PROJECT:
CONFIG_APACHEBEAM_RUNNER=dataflow
CONFIG_APACHEBEAM_JOBNAME=meetup-pipeline
CONFIG_APACHEBEAM_REGION=us-central1
CONFIG_APACHEBEAM_TEMPLOCATION=gs://GCS_BUCKET/templocation
CONFIG_APACHEBEAM_STAGINGLOCATION=gs://GCS_BUCKET/staginglocation
CONFIG_APACHEBEAM_NUMWORKERS=1
CONFIG_APACHEBEAM_MAXNUMWORKERS=1
CONFIG_APACHEBEAM_WORKERMACHINETYPE=n1-standard-2
CONFIG_APACHEBEAM_EXPERIMENTS=shuffle_mode
CONFIG_APACHEBEAM_PROJECT=GCP_PROJECT
CONFIG_APACHEBEAM_NETWORK=main
CONFIG_APACHEBEAM_SUBNETWORK=regions/us-central1/subnetworks/${CONFIG_APACHEBEAM_PROJECT}-main

Inside Main.scala file in /code/pipeline/src/main/scala/com/arquivei/pipeline folder there are three pipeline options that you nedd to choose and uncomment just ONE:

// pipeline from GCS to GCS
// val pipeline = new PipelineFromGCSToGCS(config)

// pipeline from BigQuery to GCS
// val pipeline = new PipelineFromBqToGcs(config)

// pipeline from BigQuery to Kafka
// val pipeline = new PipelineFromBqToKafka(config)
  • PipelineFromGCSToGCS is responsible to read wikipedia titles file from GCS path, transform this data applying subject filter and save the output in another GCS location.

  • PipelineFromBqToGcs is responsible to read wikepedia titles data from BigQuery table, transform this data applying subject filter and save the output in GCS bucket.

  • PipelineFromBqToKafka is responsible to read wikepedia titles data from BigQuery table, transform this data applying subject filter and sending the output to Kafka topic.

The example file with wikipedia titles could be access in /code/pipeline/example/wikipedia_titles.txt and send to your GCS bucket with the following command from this repository root path:

gsutil cp code/pipeline/example/wikipedia_titles.txt gs://GCS_BUCKET/input/

Running

  • Locally process - run_direct.sh:
sh code/pipeline/scripts/run_direct.sh 
  • Dataflow process - run_dataflow.sh:
sh code/pipeline/scripts/run_dataflow.sh

Setting local environment with Scala + Gradle + Intellij in Ubuntu

  • Intellij install
sudo snap install intellij-idea-community --classic
  • Scala install with sdkman
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install scala 2.13.7
  • Gradle install
sudo apt install openjdk-11-jdk
wget https://services.gradle.org/distributions/gradle-7.0.2-bin.zip -P /tmp
sudo unzip -d /opt/gradle /tmp/gradle-7.0.2-bin.zip
sudo ln -s /opt/gradle/gradle-7.0.2/opt/gradle/latest
  • Set local environment to enable gradle
  1. run sudo gedit /etc/profile.d/gradle.sh
  2. paste the following command inside the opened file and save
export GRADLE_HOME=/opt/gradle/latest
export PATH=${GRADLE_HOME}/bin:${PATH}
  1. run sudo chmod +x /etc/profile.d/gradle.sh && source /etc/profile.d/gradle.sh

About

Example of Apache Beam pipeline that was presented in Arquivei's Data Meetup

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Scala 89.8%Language:Shell 10.2%