This project focuses on leveraging decision trees in PySpark for both classification and regression tasks. The project is divided into two main parts:
- Introduction to Decision Trees: Understanding the principles of decision trees, their partitioning, operational methods, advantages, and limitations.
-
Exploratory Data Analysis (EDA) of Iris Dataset:
- Perform EDA on the Iris dataset, including visualizations, correlation maps, and dataset splitting into training and testing sets.
-
Classification Tree Model:
- Create a decision tree classification model using PySpark.
- Train the model on the training set and evaluate its performance on the test set.
- Assess the model using precision, accuracy, confusion matrix metrics.
- Visualize the decision tree's predictions.
-
Random Number Dataset Generation:
- Generate a dataset of random numbers for regression purposes.
- Conduct EDA on the generated dataset, including visualizations, and dataset splitting into training and testing sets.
-
Regression Tree Model:
- Develop a decision tree regression model using PySpark.
- Train the model on the training set and evaluate its performance on the test set.
- Assess the model using precision, accuracy, confusion matrix metrics.
- Visualize the decision tree's predictions.
Feel free to explore the code, adapt it to different datasets, and experiment with various decision tree parameters to enhance model performance. Happy exploring!