yelpredict

The goal of yelpredict is to provide a fast and accurate classification model that prdicts a Yelp review’s rating based on its text. The package follows tidy principles and works with the pipe operator. The code is vectorized so can process thousands of reviews in seconds, and it was approximately 84.5% accurate on a balanced dataset of over 200,000 Yelp reviews.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("chris31415926535/yelpredict")

Example

Here’s a simple example that runs some test reviews through the model. I’ve written three straightforward reviews, and then one tricky one with lots of negations to try to fool the model.

library(yelpredict)
library(tibble)
library(magrittr)

review_examples <- tribble(
  ~"review_text", ~"star_rating",
  "This place was awful!", 1,
  "The service here was great I loved it it was amazing.", 5,
  "Meh, it was pretty good I guess but not the best.", 4,
  "Not bad, not bad at all, really the opposite of terrible. I liked it.", 5
)

review_examples %>%
  knitr::kable()

review_text	star_rating
This place was awful!	1
The service here was great I loved it it was amazing.	5
Meh, it was pretty good I guess but not the best.	4
Not bad, not bad at all, really the opposite of terrible. I liked it.	5

The first step is to “flatten” the true ratings down from integer star ratings to binary positive/negative ratings. I’ve written a simple function called flatten_stars() to take care of this: you pipe the input data to it and tell it which column contains the numeric ratings, and it does the rest.

review_examples %>%
  flatten_stars(star_rating) %>%
  knitr::kable()

review_text	rating
This place was awful!	NEG
The service here was great I loved it it was amazing.	POS
Meh, it was pretty good I guess but not the best.	POS
Not bad, not bad at all, really the opposite of terrible. I liked it.	POS

The next step is to prepare the text by finding its average AFINN score, the number of buts/nots, and which word-length quintile it falls into. The prepare_yelp() function takes care of this: you pipe the data to it and specify the column containing the text.

review_examples %>%
  flatten_stars(star_rating) %>%
  prepare_yelp(review_text) %>%
  knitr::kable()

review_text	rating	afinn_mean	buts_nots	qtile
This place was awful!	NEG	-3.000000	0	1
The service here was great I loved it it was amazing.	POS	3.333333	0	1
Meh, it was pretty good I guess but not the best.	POS	2.333333	2	1
Not bad, not bad at all, really the opposite of terrible. I liked it.	POS	-1.750000	1	1

Next we invoke the model to get the probability that each review is positive. We do this by piping our flattened and prepared data to get_prob(), which has no mandatory arguments and returns a log-odds and probability for each review.

review_examples %>%
  flatten_stars(star_rating) %>%
  prepare_yelp(review_text)%>%
  get_prob() %>%
  knitr::kable()

review_text	rating	afinn_mean	buts_nots	qtile	log_odds	prob
This place was awful!	NEG	-3.000000	0	1	-4.199	0.0147886
The service here was great I loved it it was amazing.	POS	3.333333	0	1	3.401	0.9677358
Meh, it was pretty good I guess but not the best.	POS	2.333333	2	1	0.141	0.5351917
Not bad, not bad at all, really the opposite of terrible. I liked it.	POS	-1.750000	1	1	-3.729	0.0234536

The last modeling step is to predict the rating, which we do with the suitably named predict_rating() function. By default it predicts a positive rating if the probability is > 0.5, but this can be modified. (Note that 0.5 gives the best results.)

review_examples %>%
  flatten_stars(star_rating) %>%
  prepare_yelp(review_text)%>%
  get_prob() %>%
  predict_rating() %>%
  knitr::kable()

review_text	rating	afinn_mean	buts_nots	qtile	log_odds	prob	pred
This place was awful!	NEG	-3.000000	0	1	-4.199	0.0147886	NEG
The service here was great I loved it it was amazing.	POS	3.333333	0	1	3.401	0.9677358	POS
Meh, it was pretty good I guess but not the best.	POS	2.333333	2	1	0.141	0.5351917	POS
Not bad, not bad at all, really the opposite of terrible. I liked it.	POS	-1.750000	1	1	-3.729	0.0234536	NEG

Finally, if so desired you can compute the overall accuracy with the function get_accuracy(), which takes the name of the true ratings as its input.

review_examples %>%
  flatten_stars(star_rating) %>%
  prepare_yelp(review_text)%>%
  get_prob() %>%
  predict_rating() %>%
  get_accuracy(rating) %>%
  knitr::kable()

accuracy
0.75

As expected, the model is 75% accurate on this toy set of data: it got confused by the review with lots of negations.

For more details

This blog post has more details on the motivation, methodology, and empirical results.
I created this model as part of an MBA-level directed reading in natural language processing at the University of Ottawa’s Telfer School of Management. For anyone interested my entire lab notebook is available here, warts and all.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
data-raw		data-raw
data		data
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
yelpredict.Rproj		yelpredict.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

yelpredict

Installation

Example

For more details

About

Licenses found

Releases

Packages

Languages

License

Licenses found

chris31415926535/yelpredict

Folders and files

Latest commit

History

Repository files navigation

yelpredict

Installation

Example

For more details

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages