apache-beam-python-GroupByKey

This repo is all about doing a GroupByKey transformation

Raju Nooka

Sub-topic : GroupByKey

I have worked on "GroupByKey" for the file 'vgsales.csv' This dataset contains a list of video games with sales greater than 100,000 copies.

The dataset, I have choosen is Kaggle. Here is the link vgsales.csv
I have choosen the Google colaboratory to run the code.

My Google Colab Notebook on GroupByKey Transformation
Demonstration Video:
Source repo

Apache Beam

A framework to create pipelines.
The pipelines itself will executred on a streaming engine (such as Flink or Spark)
Pipelines: There are few computations like input, output, and processing are the few data processing jobs actually made.

GroupByKey

is a Beam transform for processing collections of key/value pairs.
The input to GroupByKey is a collection of key/value pairs that represents a multimap, where the collection contains multiple pairs that have the same key, but different values.
is a good way to aggregate data that has something in common
For example : data set consists of words from a text file and the line number on which they appear. We want to group together all the line numbers (values) that share the same word (key), letting us see all the places in the text where a particular word appears.

Prerequisites

Python
Apache beam
Google Colab

Commands Used

First, install apache-beam using the below command.

!pip install --quiet -U apache-beam

Install the other dependencies

!pip install apache-beam[gcp,aws,test,docs]

Program that performs the GroupByKey operation
output of the program
The command that lists all the files

! ls

First upload your .csv file to your google drive account.
The email used should be same for both google drive and google Colab accounts.
Import the .csv file run into google colab.

# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
Remove the part that contains https://
keep only the id part.

# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv')

Command to add the result to a output file

!cat output.txt-00000-of-00001 # output file

nrajubn / apache-beam-python-GroupByKey

apache-beam-python-GroupByKey

Raju Nooka

Sub-topic : GroupByKey

Apache Beam

GroupByKey

Prerequisites

Commands Used

References

About

Languages