jihoon-yang / pm4pyspark-source

Bachelor Thesis by Jihoon Yang at PADS chair of RWTH Aachen University

Home Page:http://www.pads.rwth-aachen.de/cms/PADS/Studium/Projekte/~strr/Bachelor-Thesis-Big-Data-Process-Minin/lidx/1/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Data Process Mining in Python
Integration of Spark in PM4PY for Preprocessing Event Data and Discover Process Models

PM4Py is the Process Mining library in Python and it aims at seamless integration with any kind of databases and technology.

PM4PySpark is the integration of Apache Spark in PM4Py. Especially, this Big Data connectors for PM4Py has a focus on embracing the big data world and to handle huge amount of data, with a particular focus on the Spark ecosystem:

  • Loading CSV files into Apache Spark
  • Loading and writing Parquet files into Apache Spark
  • Calculating in an efficient way the Directly Follows Graph (DFG) on top of Apache Spark DataFrames
  • Managing filtering operations (timeframe, attributes, start/end activities, paths, variants, cases) on top of Apache Spark

About

Bachelor Thesis by Jihoon Yang at PADS chair of RWTH Aachen University

http://www.pads.rwth-aachen.de/cms/PADS/Studium/Projekte/~strr/Bachelor-Thesis-Big-Data-Process-Minin/lidx/1/


Languages

Language:Python 100.0%