Abstract

In this project, I adopt a Random Forest supervised machine learning approach for Yelp review star rating prediction; I use a collection of user, review and business information and apply sentiment analysis to review text, finding that the most important prediction values are text sentiment, user average stars and average business stars. This model achieves an accuracy rate of 63.7% on a test dataset of 30,000 reviews.

Data Methodology

I used the CRISP-DM to approach this particular project as this process felt the most intuitive and adoptable. The iterative nature of this approach allowed me to ensure all the steps were aligned with the overarching project objectives (business understanding) and the uniform structure held me accountable to stay on track. Of the 6 stages, the deployment phase as I did not use due to the absence of stakeholder engagement requirements. To illustrate my use of this framework, I will be structuring my report according to each of its stages.

In the initial business understanding phase, I became familiarised with the objectives of the project and constructed a project plan, detailing a personal timeline, which ensured I completed the project in time.

Data Understanding

Upon further exploration of the 5 provided Yelp datasets, the most relevant variables were identified in 3, which were then merged together: the Yelp Review dataset, Yelp User dataset and Yelp Business dataset. One of the difficulties in merging the data which became apparent was related to the randomness of the small review and user datasets, which meant users were inconsistent between the two and not all information could be matched. To circumvent this problem, the small reviews dataset was merged with the full user information dataset, which allowed all users (give number) to be captured. Subsequently, some variables, like average review stars, were taken from the business dataset and merged.

Data Processing

Initial data processing involved omitting NA values from the newly constructed dataframe as very few of these values were found. Variable data types which had previously been wrongly classified (stars, state) were altered at this stage and the elite and friends variables were transformed into count variables, allowing them to be utilised within a model.

#Elite as numerical
merged_na1 <- merged_na %>%
  mutate(elite = str_replace(elite, "20,20,2021", "2020,2021")) %>%
  mutate(elite = str_replace(elite, "20,20 ", "2020")) %>%
  mutate(elite = str_replace(elite, "20202021", "2020,2021")) %>%
  mutate(elite = na_if(elite, "")) %>%
  mutate(elite = ifelse(is.na(elite), 0, str_count(elite, ",") + 1)) %>%
  mutate(elite= as.numeric(elite))

#Friends as numerical
merged_na1 <- merged_na1 %>%
  mutate(friends = ifelse(friends == "None", 0, str_count(friends, ",") + 1))

#State as caetgorical
merged_na1$state <- factor(merged_na1$state)

A days_difference variable was also newly generated to replace and count the days between two variables: date review was posted and date the user created an account on Yelp.

To further understand the significance of these variables and whether they were relevant to star predictions, star distributions were examined against these potential predictors.

At this stage, the significance of the 11 different compliment variables and the useful/funny/cool votes on the number of stars a review had was also uncertain. However, through visualisation of the data one can see that these variables can play some role in prediction:

Sentiment Analysis

The most valuable part of the preprocessing stage was determining a sentiment score for the review text. Many papers have explored the use of sentiment analysis in predicting Yelp reviews ((Elkouri, 2015); (Faisol et al., 2020) ; (Agrawal, 2017)), focusing solely on developing supervised learning models for star ratings based on the review text. Thus, it critical for this project to utilise this information in its’ prediction. This project uses the sentimentr package to provide an aggregate sentiment score for each review text. This package is able to calculate text polarity at the sentence level and from an extract by taking an average of the sentence sentiment scores to output an overall score and standard deviation.

Crucially this package differs from others as it is able to account for negation and other valence shifters, like amplifiers, which are a common occurrence in reviews. As can be seen in the density graph below, the scores in this data are skewed slightly to the right and fall within a +1/-1 range.

library(sentimentr)
sentiment_1 <- merged_na3 %>%
  get_sentences() %$%
  sentiment_by(text, by = list(review_id))

They correlate how one would expect with each of the star categories:

Had time permitted, this project would also have intended to account for the use of certain punctuation (ie. !) on the overall sentiment of the text.

Modelling + Evaluation

To predict the number of stars user(i) gives for business(j), I have chosen to deploy a classification tree due to the categorical nature of stars. Specifically, I have used a random forest to reduce the high variance and tendency to overfit which regular decision trees are prone to. This is done through bootstrapping aggregation, where estimations are averaged over several independently drawn trees. Both bagging and random forest follow this methodology, however random forest has a distinct advantage in that it chooses random subsets of X for each tree, hence reducing the correlation between trees, which are a source of variation.

The final cleaned dataset contains 300,000 (randomly selected) reviews from the merged Yelp dataset and splits the observations, holding 90% as the training data set and 10% as the test. The dataset includes the Stars variable and 27 estimators, picked out in the pre-processing stage. Due to uncertainty as to which of the variables would be the best predictors, the random forest is run with all estimators. After carrying out some tuning of the number of trees and the number of variables to be considered at each split(mtry), the main predictive model used is a random forest classification with 1000 trees and mtry=6.

set.seed(1)
model_RF1000_6 <- randomForest(stars.x ~ ., data = train_data, ntree = 1000, mtry=6, importance=TRUE, do.trace = TRUE)
save(model_RF1000_6, file= 'randomforest_1000_6')
pred_RF_test1000_6 = predict(model_RF1000_6, test_data, do.trace = TRUE)
#Accuracy on training data
accuracytrain1000_6 <- sum(diag(model_RF1000_6$confusion)) / sum(model_RF1000_6$confusion)
print(paste("Accuracy on Train Data:", round(accuracytrain1000_6, 4)))
#Accuracy on test data
conf_matrix1000_6 <- table(Actual = test_data$stars.x, Predicted = pred_RF_test1000_6)
accuracy1000_6 <- sum(diag(conf_matrix1000_6)) / sum(conf_matrix1000_6)
print(paste("Accuracy on Test Data:", round(accuracy1000_6, 4)))

The out-of-bag (OOB) error generated by each model was used to decide the number of trees and split variables considered. The OOB error is an unbiased estimator of model performance based on observations outside of the bootstrapped sample. The results are shown below:

Trees OOB Error (%)
100 37.23
500 36.44
1000 36.25

Increasing the number of trees to 1000 decreases the OOB error. Ideally one would choose ntrees where the OOB error starts to stabilise but this is too computationally expensive to determine so we will use ntree=1000.

The default mtry used is the square root of the number of variables used within the classification tree, which in this case would be 5. To find the optimal mtry, the range 3-7 was tested with ntree=100.

The OOB error stabilises when mtry=6.This justifies running a random forest model with ntree=1000 and mtry=6.

This RF model had an accuracy rate of 63.71% on the training data and an OOB error of 36.29%. This is summarised below alongside other specifications ran:

Model Training Accuracy Test Accuracy OOB
Main: Random Forest (ntree=1000, mtry=6) 63.71 63.08 36.29
Random Forest (ntree=500, mtry=6) 63.61 63.05 36.39
Random Forest (ntree=100, mtry=5) 62.77 62.68 37.23
Random Forest (ntree=500, mtry=5) 63.56 62.8 36.44
Random Forest (ntree=1000, mtry=5) 63.75 63.01 36.25
Bagging (ntree=500) NA 55.7 NA

The main model outperforms the other variations and bagging on the test data accuracy.

Upon further examination, it was possible to breakdown the importance of each of the variables.

This graph shows the decrease in mean accuracy which would be observed if each of the variables were removed from the model. It is evident that sentiment analysis is crucial to prediction and has been the key feature behind the robustness of this model.

Challenges

The main challenge I encountered revolved around sentiment analysis, particularly in locating a suitable package to facilitate the analysis. In my research, I found sentiment analysis was commonly applied at an individual word level by using a pre-existing dictionary to assign individual scores and count frequencies. However, adapting this approach to assess sentiment at the sentence and extract levels – consistent with the format of reviews- posed a significant challenges when using existing packages, such as tidytext and vader. After trying these packages, I felt these methods to interpret sentiment would not be effective for multi-sentence reviews.I continued doing further research and was able to find the sentimentr package, which fit my needs.

References

Agrawal, A., 2017. Yelp analytics (Doctoral dissertation, Rutgers University-Camden Graduate School). Elkouri, A., 2015. Predicting the sentiment polarity and rating of yelp reviews. arXiv preprint arXiv:1512.06303. Faisol, H., Djajadinata, K. and Muljono, M., 2020, September. Sentiment analysis of yelp review. In 2020 International seminar on application for technology of information and communication (iSemantic) (pp. 179-184). IEEE.