abidor13 / Amazon_Vine_Analysis

Given access to approximately 50 datasets, each containing reviews of a specific product and written by members of the paid Amazon Vine Program. We used PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into PgAdmin.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Amazon_Vine_Analysis

Overview of the analysis of the Vine program:

Since our work with Jennifer on the SellBy project was so successful, we were tasked to analyze more Amazon reviews written by members of the paid Amazon Vine program. The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. Companies like SellBy pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review.

In order for our analysis to be accurate, we were given access to approximately 50 datasets, each one contains reviews of a specific product. Our objective was to pick one of these datasets and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin.

Next, using PySpark, we helped determine if there is any bias toward favorable reviews from Vine members in the chosen dataset.

Results:

Below are few tables we created initially, we then loaded these tables and merged them into our PgAdmin.

  • First access the entire data:

  • Select a Customers table:

  • Select a Products table:

  • Select a Review ID table:

We ran into an issue loading the Vine dataframe into its matching table in pgAdmin. To resolve the problem, we casted/converted the star_rating column into an integer and that solved it.

  • Select a Vine table

Once we had access to the Vine table, we proceeded with more focus into that specific dataframe since it contained most of the information we were interested in.

Over the next steps of the project, we analyzed the vine table a little more. We determined the total number of reviews, the number of 5-star reviews, and the percentage of 5-star reviews for the two types of review (paid vs unpaid), below are screenshots of our findings.

  • We retrieved all the rows where the total votes count is equal to or greater than 20, this was done in order to pick reviews that are more likely to be helpful and to avoid having division by zero errors later on.

  • Next we retrieved only the "Helpful" reviews.

  • We also created a new table that retrieved all the rows where a review was written as part of the Vine program, and another table for the Unpaid reviews, the ones that are not part of the Vine program.

At last, our analysis helped us answer the below questions.

  • How many Vine reviews and non-Vine reviews were there?

    • Based on our calculations, there was 969 Vine Reviews, and 43,745 non-Vine reviews.
  • How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?

    • The Vine 5-star reviews had a total of 430 reviews, while 19,233 were 5-star reviews from the non-Vine program.
  • What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?

    • 44.38% reviews within the Vine reviews program are 5-star reviews.
    • 43.97% reviews within non-Vine program are also 5-star reviews.
    • Also, the paid 5-star reviews only accounted for 0.2% of the entire reviews provided.
    • And 0.73% was the overall percentage for non-Vine program customers who gave a 5-star rating.

Summary:

Based on the results above, there is no clear proof of any positivity bias for reviews in the Vine program. The percentage of 5-star reviews are pretty close whether the reviewer is part of the Vine program or not.

Another analysis to help support this finding, is to compare the 5-star reviews within the table to the total number of reviews provided. The majority of the reviews given were from customers not in the Vine program.

We can also add the 4-star reviews to our analysis, while it's not a perfect star, it's pretty close and most customers would still purchase the products with equal or higher than 4-star review.

About

Given access to approximately 50 datasets, each containing reviews of a specific product and written by members of the paid Amazon Vine Program. We used PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into PgAdmin.


Languages

Language:Jupyter Notebook 100.0%