- Overview
- Models Used
- Data Preprocessing
- Data Loading
- Exploratory Data Analysis (EDA)
- Models Training and Evaluation
- 6.1 Model Training
- 6.2 Model Evaluation
- Interpretation of Report
- Data Visualization
- Findings
- 9.1 Data Exploration
- 9.2 Model Performance
- Output
- Conclusion
This project aims to predict product sales based on advertising expenditures, focusing on 'TV advertising'. Machine learning techniques are employed to analyze and interpret data, enabling businesses to optimize advertising strategies and maximize sales potential.
- Random Forest Regression: Utilized for its ability to handle complex relationships and provide robust predictions.
- Linear Regression: Employed as a baseline for comparison due to its simplicity and interpretability.
Data preprocessing involves:
- Handling missing values (if any).
- Encoding categorical variables (if applicable).
- Scaling numerical features.
The advertising dataset is loaded from a CSV file containing columns for 'TV', 'Radio', 'Newspaper' advertising expenditures, and 'Sales'.
EDA is performed to:
- Visualize relationships between features and target variable ('Sales').
- Identify correlations and distributions of features.
- Detect outliers or anomalies in the data.
-
Random Forest Regression:
- GridSearchCV used to optimize hyperparameters.
- Best model selected based on cross-validated negative MSE score.
-
Linear Regression:
- Simple model trained as a baseline for comparison.
Evaluation metrics computed include:
- Mean Squared Error (MSE): Measure of prediction accuracy.
- Mean Absolute Error (MAE): Provides absolute measure of average error.
- R-squared (R2): Indicates goodness of fit of the model.
- Comparison of model performance based on evaluation metrics.
- Analysis of coefficients (for Linear Regression) and feature importances (for Random Forest) to interpret relationships between 'TV advertising' and 'Sales'.
-
Random Forest Model:
- Best Parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
- Best Negative MSE Score: -1.6148
- Evaluation Metrics on Test Set (Random Forest):
- Mean Squared Error (MSE): 1.4591
- Mean Absolute Error (MAE): 0.9170
- R-squared (R2): 0.9528
-
Linear Regression Model:
- Coefficients: 0.0555
- Intercept: 7.0071
- Evaluation Metrics on Test Set (Linear Regression):
- Mean Squared Error (MSE): 6.1011
- R-squared (R2): 0.8026
-
Model Comparison: The Random Forest model performs better than the Linear Regression model in predicting sales based on TV advertising. It achieves this by considering complex interactions and nonlinear relationships in the data, leading to more accurate predictions.
-
Interpretation: For businesses, these models provide insights into how changes in advertising spending (specifically on TV) can impact sales. They help optimize advertising budgets by predicting potential sales outcomes with different strategies.
-
Conclusion: Based on these results, businesses can use the Random Forest model to make more reliable predictions about the effectiveness of their advertising campaigns, thereby maximizing their sales potential.
- Visual representations include scatter plots, pair plots, and bar plots to illustrate relationships and distributions.
- Plots of model predictions vs. actual sales to assess performance visually.
- Strong positive correlation observed between 'TV advertising' and 'Sales'.
- 'TV' expenditure shows highest influence on 'Sales' compared to 'Radio' and 'Newspaper'.
- Random Forest outperforms Linear Regression in terms of predictive accuracy.
- Lower MSE and higher R-squared indicate Random Forest captures the relationship more effectively.
- Predicted sales values for new data points using both Random Forest and Linear Regression models.
- Random Forest is recommended for predicting sales based on 'TV advertising' due to its superior performance.
- Insights gained can guide advertising strategies to optimize spending and maximize sales.