YouTube Adview Prediction

Project Overview

This project focuses on predicting the number of adviews on YouTube videos based on various metrics and attributes such as views, likes, dislikes, comments, duration, and category. The goal is to build a regression model that accurately predicts the adview count, allowing advertisers to make informed decisions and optimize their marketing strategies.

Dataset

The dataset used for this project is available in the train.csv file, which contains information about approximately 15,000 YouTube videos. It includes attributes such as vidid (unique identification ID for each video), adview (number of adviews), views, likes, dislikes, comments, published date, duration, and category of each video. The dataset is already provided in the GitHub repository along with the project files.

Steps and Tasks

Import the necessary libraries and load the dataset:

Use popular libraries like numpy, pandas, matplotlib, seaborn, scikit-learn, and keras.
Load the dataset using pandas and explore its shape and data types.

Visualize the dataset:

Use matplotlib and seaborn to create visually appealing plots.
Plot distributions, such as histograms and boxplots, to understand the data distributions for each attribute.
Create a heatmap to visualize the correlations between different features.
Identify and remove outliers using appropriate techniques.

Data Cleaning:

Handle missing values by applying techniques like imputation or deletion.
Remove irrelevant or unimportant features that do not contribute significantly to the prediction task.

Data Transformation:

Convert categorical attributes into numerical values using techniques like label encoding or one-hot encoding.
Perform necessary transformations, such as converting date and time attributes into suitable formats.
Utilize feature engineering techniques to create new features that may enhance the model's performance.

Data Normalization and Splitting:

Normalize the numerical attributes to ensure all features are on a similar scale.
Split the dataset into training, validation, and test sets in an appropriate ratio, such as 80:10:10 or 70:15:15.

Model Training and Evaluation:

Train various regression models, including linear regression, support vector regressor, decision tree regressor, random forest regressor, and artificial neural networks.
Use scikit-learn library to fit the models on the training data and calculate error metrics (e.g., mean squared error, mean absolute error) to evaluate their performance.

Decision Tree Regressor and Random Forest Regressor:

Import decision tree regressor and random forest regressor models from the scikit-learn library.
Configure appropriate hyperparameters for these models.
Train the models using the training data and evaluate their performance by calculating error metrics.

Artificial Neural Network:

Utilize the Keras library to build an artificial neural network.
Define the model architecture, including the number of layers, neurons, activation functions, and optimization algorithm.
Train the neural network on the training data and evaluate its performance using error metrics.
Experiment with different architectures and hyperparameters to improve the model's performance.

Model Selection:

Compare the error metrics and generalization performance of different models.
Select the model with the lowest error and good generalization performance on the validation set.
Utilize evaluation metrics, such as F1 score and ROC curves, to assess the model's performance.

Save the Model and Make Predictions:

Save the selected model for future use using appropriate functions or methods.
Make predictions on the test set using the saved model to estimate the number of adviews.

Tools and Libraries Used

Python: Programming language used for data analysis and machine learning.
NumPy: Library for numerical operations.
pandas: Library for data analysis and manipulation.
Matplotlib: Library for data visualization.
seaborn: Library for statistical data visualization.
scikit-learn: Library for machine learning models and evaluation.
Keras: Deep learning library for building neural networks.

Conclusion

By implementing various regression models and artificial neural networks on the YouTube adview dataset, we can accurately predict the number of adviews for YouTube videos. The selected model can assist advertisers in making data-driven decisions and optimizing their marketing strategies.

For a detailed implementation and analysis of the project, please refer to the notebook provided.

atharvajoshi01 / Youtube-Adview-Prediction