data-analysis data-analytics data-science data-visualization jupyter-notebook pandas python

ALX-Udacity Data Analytics Nanodegree Project-II

📝Introduction

The dataset is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

📝Data Cleaning & Transformation

Quality issues

There are columns with missing values namely - expanded_urls
Name, doggo, floofer, pupper and puppo columns have value with the name None
Values in source column are not human readable
Use of _instead of space in p1, p2 and p3 column values. Also, values upper case sometimes lowercase other times in p1, p2 and p3 columns
column names are not clearly descriptive
There are image duplicate predictions present for duplicate jpg_url with different tweet ids and rest all the data same.
tweet id title is different, id here tweet_id in others.
There are image duplicate predictions present for duplicate jpg_url with different tweet ids and rest all the data same

Tidiness issues

One variable in 4 columns - dog_stage(doggo, floofer, pupper, puppo)
Merge the tables - archive_tweet and df

📝Insights

Comparison Of Average Number Of Retweets And Favorites In A Month
Comparison Of Average Rating Of Dog Stages
Algorithm's First Prediction Efficiency for top 10 most frequent predictions

📝Conclusion & Observation

Algorithm has proved to be least efficient for Chihuahuas.
The most predictions are recorded for Golden Retrievers and least for Malamutes
Therefore, we can say that, this algorithm favors the Golden Retrievers in this dataset
From the bar graph, it can be clearly seen that puppo has the highest average rating while pupper has the lowest.
From the pie chart, we can see that there are 63.48% of puppers in the overall population of the dataset followed by doggo, then floofer and at the last, puppo. - - This indicates the possible reason of lowest average rating of pupper and highest average rating of doggo because of huge differences in their proportion in the dataset
As we can see that on an average, number of favorites are quite higher in January.
February has the second highest both average number of favorites and retweets.
December has the least number of favourites which is in close range with september, october & november.

About

The dataset is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc.

data-analysis data-analytics data-science data-visualization jupyter-notebook pandas python

Languages

Language:Jupyter Notebook 100.0%