Rajeshwari-Rudra / apache_beam-python

A demo project on batch data-parallel processing using Apache Beam and Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Size Limit logo by Anton Lovchikov

apache_beam-python

A demo project on batch data-parallel processing using Apache Beam and Python

Team Members



Rajeshwari Rudravaram

💻

Sri Sudheera Chitipolu

💻

Pooja Gundu

💻

Raju Nooka

💻

Sai Rohith Gorla

💻

Rohith Reddy Avisakula

💻

Introduction

  • Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow and Hazelcast Jet.
  • Data process can either be for Analytic purpose or ETL(Extract, Transform and Load). Also, it doesn't rely on anyone of the execution engine and data agnostic, programming agnostic.

Working Process

Apache Beam

  • Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
  • Batch pipeline : The type of pipeline used to process the data in batches.
  • Streaming data pipelines : These pipelines handles millions of events at scale in real time.
  • Apache Beam SDK(Software Development Kit) for python provides access to the Apache Beam capabilities using Python Programming Language.
  • Using Apache Beam SDK one can build a program that defines the pipeline.

Beam programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

Google colaboratory

  • Colab is a Python development environment that runs in the browser using Google Cloud.

Roles and Responsibilities

  • Raju Nooka - GroupByKey Transformation
  • Sri Sudheera Chitipolu - Groupby Transformation
  • Rohith Reddy Avisakula- GroupIntoBatches
  • Pooja Gundu - I will be working on word count of a dataset.
  • Sai Rohith Gorla - BatchMatches
  • Rajeshwari Rudravaram - Minimal Word Count of a Dataset.

Size Limit logo by Anton Lovchikov

Sri Sudheera Chitipolu Demo link

Demonstration on GroupBy Transformation

Prerequisites

  • Apache Beam
  • Python
  • Google Colab
  • Kaggle for data set

Process and Commands

  • Open firefox or safari browser
  • Type Google Colab
  • Click on first link that is Google Colab
  • Sign in with google account
  • Click on notebook after appearing the window with recent

Note: Google Colab works similar to jupyter notebook

  • After writing and execution of code,save file in local or Github
  • Give the command to install apache beam
!pip install --quiet -U apache-beam

Program Output

References

Size Limit logo by Anton Lovchikov

Rajeshwari Rudravaram Demo Link

Sub-topic : Minimal Word Count


  • Minimal word count is an implementation of word count for a given dataset.
  • I will be creating a simple data processing pipeline that reads a text file and counts the number of occurrences of every word in that dataset.
  • I have worked on "Minimal Word Count" for the file 'key_benifits.csv' of Shopify app store.
  • The dataset, I have choosen is Kaggle. Here is the link Shopify app store
  • I have choosen the Google colaboratory to run the code.

Key Concepts for this Project:

  1. Reading data from text file
  2. Specifying 'inline' transforms
  3. Counting a PCollection
  4. Writing data to text file

Demonstration on Minimal Word Count

Prerequisites

  • Apache Beam
  • Python
  • Google Colab

Commands to check the versions on your machine

python --version
pip --version
  • Note:- Python must be greater than 3.6.0

Command to install Apache beam

python -m pip install apache-beam

Command to install executive engines

pip install apache-beam[gcp,aws,test,docs]

Guiedlines for Minimal Word Count

  • Installation of Apache Beam on Colab notebook and upload the input file to notebook

  • Importing required libraries and run the following commands

  • To check the list of files on your notebook

  • command to add the result to output file

References


Size Limit logo by Anton Lovchikov

Pooja Gundu Demo link

Sub-topic : WordCount

  • I have worked on "Word Count" for the file 'superbowl-ads.csv' This file contains data about all advertisements shown during the Super Bowl(most watched US Program) across the years from 1967 to 2020.
  • The dataset, I have choosen is Kaggle. Here is the link superbowl-ads.csv
  • I have choosen the Google colaboratory to run the code.

Prerequisites

  • Apache beam
  • Google Colab
  • Google drive account
  • Python

Process and the commands used

  • Create a google colab account and open a new notebook, rename it as you required.
  • First, install the apache-beam using the below command
!pip install --quiet -U apache-beam

  • Also, install all the dependencies(executive engines)required using the command
!pip install apache-beam[gcp,aws,test,docs]

  • Program that performs the word count operation

  • output of the program

  • The command that lists all the files

! ls

  • First upload your .csv file to your google drive account. The email used should be same for both google drive and google Colab accounts.

  • Import the .csv file run the below commands otherwise you will get file not found error because it is not imported into google colab.

# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
  • To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
  • Remove the part that contains https://
  • keep only the id part.
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv') 
  • Command to add the result to a output file
!cat output.txt-00000-of-00001 # output file

References


Size Limit logo by Anton Lovchikov

Raju Nooka Demo link

Sub-topic : GroupByKey

  • I have worked on "GroupByKey" for the file 'vgsales.csv' This dataset contains a list of video games with sales greater than 100,000 copies.
  • The dataset, I have choosen is Kaggle. Here is the link vgsales.csv
  • I have choosen the Google colaboratory to run the code.

Prerequisites

  • Python
  • Apache beam
  • Google Colab

Commands Used

  • First, install apache-beam using the below command.
!pip install --quiet -U apache-beam

  • Install the other dependencies
!pip install apache-beam[gcp,aws,test,docs]

  • Program that performs the GroupByKey operation

  • output of the program

  • The command that lists all the files

! ls

  • First upload your .csv file to your google drive account.
  • The email used should be same for both google drive and google Colab accounts.
  • Import the .csv file run into google colab.
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
  • To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
  • Remove the part that contains https://
  • keep only the id part.
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv') 
  • Command to add the result to a output file
!cat output.txt-00000-of-00001 # output file

References

Rohith Avisakula Demo link

Sub-topic : GroupIntoBatches

Prerequisites

  • Python
  • Apache beam
  • Google Colaboratory

Commands Used

  • Install apache beam using the below command.
pip install apache-beam
  • Next install the dependencies required using below command.
!pip install apache-beam[gcp,aws,test,docs]
  • The command that lists all the files.
! ls
  • First sign in to google drive account and google colab with same credentials and upload .csv file to google drive account.
  • Import .csv file into google colab.
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentF
  • Command for result
! cat results.txt-00000-of-00001

Screenshots for commands

  • For installation of apache beam.

  • For installing required dependencies and libraries.

  • Program for GroupIntoBatches.

  • For importing file into colobaratory.

  • For display of list of files.

  • For output of the file.

References

Sai Rohith Gorla Demo link

Sub Topic: BatchElements

Prerequisites

  • Python
  • Apache beam
  • Google Colab

Commands Used

  • First, install apache-beam using the below command.
!pip install --quiet -U apache-beam
  • Install the other dependencies
!pip install apache-beam[gcp,aws,test,docs]

  • Program that performs the BatchElements operation

  • output of the program

  • The command that lists all the files

! ls

  • First upload your .csv file to your google drive account.
  • The email used should be same for both google drive and google Colab accounts.
  • Import the .csv file run into google colab.
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
  • To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
  • Remove the part that contains https://
  • keep only the id part.
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv') 
  • Command to add the result to a output file
!cat output.txt-00000-of-00001 # output file

References

About

A demo project on batch data-parallel processing using Apache Beam and Python

License:Apache License 2.0


Languages

Language:Jupyter Notebook 100.0%