GUNDUPOOJA / apache_beam_python-wordcount

Apache beam python Maximum transformation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

apache_beam_python-WordCount

Apache beam python wordcount transformation Size Limit logo by Anton Lovchikov

Pooja Gundu

Sub-topic : WordCount

  • I have worked on "Word Count" for the file 'superbowl-ads.csv' This file contains data about all advertisements shown during the Super Bowl(most watched US Program) across the years from 1967 to 2020.
  • The dataset, I have choosen is Kaggle. Here is the link superbowl-ads.csv
  • I have choosen the Google colaboratory to run the code.

Apache-beam

  • Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing.
  • Original author: GOOGLE, released in the year 2016.
  • written in Java, Python, Go languages.
  • ETL allows businesses to gather data from multiple sources and consolidate it into a single, centralized location

Data pipelines:

  • consolidating data from all your disparate sources into one common destination, enable quick data analysis for business insights.
  • They also ensure consistent data quality, which is absolutely crucial for reliable business insights.

Batch pipeline:

  • The type of pipeline used to process the data in batches.

Streaming Data pipelines:

  • These pipelines handles millions of events at scale in real time.

Google Colab

  • It is a collaboration tool similar to Jupyter notebook.
  • Everything is pre-installed, no need to explicitly install anything.
  • Sign in to Google colab account and after writing, testing the code save the file in Github or in local machine.

Prerequisites

  • Apache Beam
  • Python
  • Google Colab
  • Google drive account

Installation steps

  • Check the versions on your machine using commands
python --version
pip --version

Process and the Commands used to do word count transformation.

  • Create a google colab account and open a new notebook, rename it as you required.
  • First, install the apache-beam using the below command
python -m pip install apache-beam

To install extra dependencies use the below command

pip install apache-beam[gcp,aws,test,docs]

  • For more information, click this link

  • Program that performs the word count operation

  • output of the program

  • The command that lists all the files

! ls

  • First upload your .csv file to your google drive account. The email used should be same for both google drive and google Colab accounts.
  • To Import the .csv file run the below commands otherwise you will get file not found error because it is not imported into google colab.
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
  • To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
  • Remove the part that contains https://
  • keep only the id part.
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv') 
  • Command to add the result to a output file
!cat output.txt-00000-of-00001 # output file

References

About

Apache beam python Maximum transformation


Languages

Language:Jupyter Notebook 100.0%