Apache beam python wordcount transformation
- I have worked on "Word Count" for the file 'superbowl-ads.csv' This file contains data about all advertisements shown during the Super Bowl(most watched US Program) across the years from 1967 to 2020.
- The dataset, I have choosen is Kaggle. Here is the link superbowl-ads.csv
- I have choosen the Google colaboratory to run the code.
- Pooja's Google Colab Notebook on wordcount Transformation
- Demonstration Video:
- source repo
- Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing.
- Original author: GOOGLE, released in the year 2016.
- written in Java, Python, Go languages.
- ETL allows businesses to gather data from multiple sources and consolidate it into a single, centralized location
- consolidating data from all your disparate sources into one common destination, enable quick data analysis for business insights.
- They also ensure consistent data quality, which is absolutely crucial for reliable business insights.
- The type of pipeline used to process the data in batches.
- These pipelines handles millions of events at scale in real time.
- It is a collaboration tool similar to Jupyter notebook.
- Everything is pre-installed, no need to explicitly install anything.
- Sign in to Google colab account and after writing, testing the code save the file in Github or in local machine.
- Apache Beam
- Python
- Google Colab
- Google drive account
- Check the versions on your machine using commands
python --version
pip --version
- Create a google colab account and open a new notebook, rename it as you required.
- First, install the apache-beam using the below command
python -m pip install apache-beam
pip install apache-beam[gcp,aws,test,docs]
-
For more information, click this link
-
The command that lists all the files
! ls
- First upload your .csv file to your google drive account. The email used should be same for both google drive and google Colab accounts.
- To Import the .csv file run the below commands otherwise you will get file not found error because it is not imported into google colab.
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Autheticate E-Mail ID
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
- To get the id of the file, right-click on the file in google drive account, select share link option, then copy the link.
- Remove the part that contains https://
- keep only the id part.
# Get File from Drive using file-ID
# Get the file
downloaded = drive.CreateFile({'id':'1b73yN7MjGytqSP5wimYAQmtByOvGGe8Y'}) # replace the id with id of file you want to access
downloaded.GetContentFile('superbowl-ads.csv')
- Command to add the result to a output file
!cat output.txt-00000-of-00001 # output file