zuzannapiekarczyk / Decision-Trees-in-PySpark-Project

This project focuses on leveraging decision trees in PySpark for both classification and regression tasks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decision-Trees-in-PySpark-Project

Project Overview

This project focuses on leveraging decision trees in PySpark for both classification and regression tasks. The project is divided into two main parts:

Theoretical Section:

  • Introduction to Decision Trees: Understanding the principles of decision trees, their partitioning, operational methods, advantages, and limitations.

Practical Section:

a. Classification Trees:

  1. Exploratory Data Analysis (EDA) of Iris Dataset:

    • Perform EDA on the Iris dataset, including visualizations, correlation maps, and dataset splitting into training and testing sets.
  2. Classification Tree Model:

    • Create a decision tree classification model using PySpark.
    • Train the model on the training set and evaluate its performance on the test set.
    • Assess the model using precision, accuracy, confusion matrix metrics.
    • Visualize the decision tree's predictions.

b. Regression Trees:

  1. Random Number Dataset Generation:

    • Generate a dataset of random numbers for regression purposes.
    • Conduct EDA on the generated dataset, including visualizations, and dataset splitting into training and testing sets.
  2. Regression Tree Model:

    • Develop a decision tree regression model using PySpark.
    • Train the model on the training set and evaluate its performance on the test set.
    • Assess the model using precision, accuracy, confusion matrix metrics.
    • Visualize the decision tree's predictions.

Feel free to explore the code, adapt it to different datasets, and experiment with various decision tree parameters to enhance model performance. Happy exploring!

About

This project focuses on leveraging decision trees in PySpark for both classification and regression tasks.

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 100.0%