The purpose of this project is to analyze the Amazon Vine program and determine if there is bias towards favorable reviews between Vine/non-Vine members. PySpark was used to perform the ETL process and then used to connect to an Amazon Web Services RDS instance. From PySpark, the data from the instance was loaded and transformed into pgAdmin tables. Lastly, Jupyter notebook was used to calculate different metrics on the vine reviews. This specific dataset was US reviews for Toys.
Percentage of Vine Reviews that were 5 stars
Percentage of non-Vine reviews that were 5 stars
The metrics revealed that 2.6% of the reviews in the Vine program are 5-star reviews while the non-Vine reviews are 97.4% of the total. Based on the results, there is no positivity bias for reviews in the Vine program. The Vine program which requires reviews is not inclining customers to give good reviews once they receive their orders. Also, this dataset happens to have almost all people doing toy reviews that are not a part of the Vine program, so it is not the ideal dataset to explore if there is positivity bias. An additional analysis that could be conducted at the opposite spectrum and repeat the analysis for 1-star reviews. The results would help to confirm if any bias can be inferred from the Toys dataset.