nrajubn / apache-beam-python-GroupByKey

This repo is all about doing a GroupByKey transformation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

apache-beam-python-GroupByKey

This repo is all about doing a GroupByKey transformation

Size Limit logo by Anton Lovchikov

Raju Nooka

Sub-topic : GroupByKey

  • I have worked on "GroupByKey" for the file 'vgsales.csv' This dataset contains a list of video games with sales greater than 100,000 copies.
  • The dataset, I have choosen is Kaggle. Here is the link vgsales.csv
  • I have choosen the Google colaboratory to run the code.

Apache Beam

  • A framework to create pipelines.
  • The pipelines itself will executred on a streaming engine (such as Flink or Spark)
  • Pipelines: There are few computations like input, output, and processing are the few data processing jobs actually made.

GroupByKey

  • is a Beam transform for processing collections of key/value pairs.
  • The input to GroupByKey is a collection of key/value pairs that represents a multimap, where the collection contains multiple pairs that have the same key, but different values.
  • is a good way to aggregate data that has something in common
  • For example : data set consists of words from a text file and the line number on which they appear. We want to group together all the line numbers (values) that share the same word (key), letting us see all the places in the text where a particular word appears.

Prerequisites

  • Python
  • Apache beam
  • Google Colab

Commands Used

  • First, install apache-beam using the below command.
!pip install --quiet -U apache-beam

  • Install the other dependencies
!pip install apache-beam[gcp,aws,test,docs]

  • Program that performs the GroupByKey operation

  • output of the program

  • The command that lists all the files

! ls

  • First upload your .csv file to your google drive account.
  • The email used should be same for both google drive and google Colab accounts.
  • Import the .csv file run into google colab.
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
  • To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
  • Remove the part that contains https://
  • keep only the id part.
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv') 
  • Command to add the result to a output file
!cat output.txt-00000-of-00001 # output file

References

About

This repo is all about doing a GroupByKey transformation


Languages

Language:Jupyter Notebook 100.0%