sudheera96 / abeam_python_Groupby

Apache Beam python project on GroupBy Transformation by using dataset from kaggle.

Home Page:https://sudheera96.github.io/abeam_python_Groupby/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Size Limit logo by Anton Lovchikov

abeam_python_Groupby

What is apache beam?

Apache Beam basically a data processing platform. Data processing can be either for analytics purpose or it can be ETL (Extract, Transfer, Load). Apache beam doesn't reply on any one execution engine.The input data can be streaming data or batch data. Input data can be from some database like relational database or memory database. so apache beam is execution platform agnostic and data agnostic also programming agnostic i.e, it supports multiple programming language you can write your logic in java python,go.

Size Limit logo by Anton Lovchikov

Terminology

  • Pipelines End to end data processing.
  • Pcollection Reading of the input data is p collection applying any transormations on that data and creating new data from that is also p collection.
  • Ptransorm Logic applying to data is p transform ((https://beam.apache.org/documentation/programming-guide/#transforms)
  • PRunner specifies where and how the pipeline should execute.

Quickstart

Check versions

python --version
pip --version

python must be 3.6 or higher, pip must be 7.0.0 or newer

Install apache beam

python -m pip install apache-beam
  • Extra Requirements

Installation for extra dependencies follow below command

pip install apache-beam[gcp,aws,test,docs]

For more detail go to this link

Google Colab

Google Colab has python preinstalled. On it, it is easy to start using apache beam.

  • Open firefox or safari browser
  • Type Google Colab
  • Click on first link that is Google Colab
  • Sign in with google account
  • Click on notebook after appearing the window with recent

Note: Google Colab works similar to jupyter notebook

  • After writing and execution of code,save file in local or Github

Look at my netflixGroupBy.ipynb Colab python notebook

Links

Sri Sudheera Colab File

Sri Sudheera Project input file

Sri Sudheera Project output file

Resources

Apache Beam Group By

Kaggle data set

Apache Beam Colab

About

Apache Beam python project on GroupBy Transformation by using dataset from kaggle.

https://sudheera96.github.io/abeam_python_Groupby/

License:Apache License 2.0


Languages

Language:Jupyter Notebook 100.0%