Predictive Models for Online New Popularity Data Set

The purpose of this repository is to create predictive models and automating R Markdown reports. Analysis are completed on the Online News Popularity Data Set from UCI. Additional information about this data can be accessed here.

The data contains the following variables:

  1. url: URL of the article (non-predictive)
  2. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
  3. n_tokens_title: Number of words in the title
  4. n_tokens_content: Number of words in the content
  5. n_unique_tokens: Rate of unique words in the content
  6. n_non_stop_words: Rate of non-stop words in the content
  7. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
  8. num_hrefs: Number of links
  9. num_self_hrefs: Number of links to other articles published by Mashable
  10. num_imgs: Number of images
  11. num_videos: Number of videos
  12. average_token_length: Average length of the words in the content
  13. num_keywords: Number of keywords in the metadata
  14. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
  15. data_channel_is_entertainment: Is data channel 'Entertainment'?
  16. data_channel_is_bus: Is data channel 'Business'?
  17. data_channel_is_socmed: Is data channel 'Social Media'?
  18. data_channel_is_tech: Is data channel 'Tech'?
  19. data_channel_is_world: Is data channel 'World'?
  20. kw_min_min: Worst keyword (min. shares)
  21. kw_max_min: Worst keyword (max. shares)
  22. kw_avg_min: Worst keyword (avg. shares)
  23. kw_min_max: Best keyword (min. shares)
  24. kw_max_max: Best keyword (max. shares)
  25. kw_avg_max: Best keyword (avg. shares)
  26. kw_min_avg: Avg. keyword (min. shares)
  27. kw_max_avg: Avg. keyword (max. shares)
  28. kw_avg_avg: Avg. keyword (avg. shares)
  29. self_reference_min_shares: Min. shares of referenced articles in Mashable
  30. self_reference_max_shares: Max. shares of referenced articles in Mashable
  31. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
  32. weekday_is_monday: Was the article published on a Monday?
  33. weekday_is_tuesday: Was the article published on a Tuesday?
  34. weekday_is_wednesday: Was the article published on a Wednesday?
  35. weekday_is_thursday: Was the article published on a Thursday?
  36. weekday_is_friday: Was the article published on a Friday?
  37. weekday_is_saturday: Was the article published on a Saturday?
  38. weekday_is_sunday: Was the article published on a Sunday?
  39. is_weekend: Was the article published on the weekend?
  40. LDA_00: Closeness to LDA topic 0
  41. LDA_01: Closeness to LDA topic 1
  42. LDA_02: Closeness to LDA topic 2
  43. LDA_03: Closeness to LDA topic 3
  44. LDA_04: Closeness to LDA topic 4
  45. global_subjectivity: Text subjectivity
  46. global_sentiment_polarity: Text sentiment polarity
  47. global_rate_positive_words: Rate of positive words in the content
  48. global_rate_negative_words: Rate of negative words in the content
  49. rate_positive_words: Rate of positive words among non-neutral tokens
  50. rate_negative_words: Rate of negative words among non-neutral tokens
  51. avg_positive_polarity: Avg. polarity of positive words
  52. min_positive_polarity: Min. polarity of positive words
  53. max_positive_polarity: Max. polarity of positive words
  54. avg_negative_polarity: Avg. polarity of negative words
  55. min_negative_polarity: Min. polarity of negative words
  56. max_negative_polarity: Max. polarity of negative words
  57. title_subjectivity: Title subjectivity
  58. title_sentiment_polarity: Title polarity
  59. abs_title_subjectivity: Absolute subjectivity level
  60. abs_title_sentiment_polarity: Absolute polarity level
  61. shares: Number of shares (target)

In this project, subsets by data_channel_is_* were produced for automating R Markdown reports. Predictive models used include linear regression models, random forest model, and boosted tree. These models were constructed on training data set and than tested on testing data set. The best model was selected based on lowest RMSE.

List of packages used:

  1. caret To run the Regression and ensemble methods with Train/Split and cross validation.
  2. dplyr A part of the tidyverse used for manipulating data.
  3. GGally To create ggcorr() and ggpairs() correlation plots .
  4. glmnet To access best subset selection.
  5. ggplot2 A part of the tidyverse used for creating graphics.
  6. gridextra To plot with multiple grid objects.
  7. gt To test a low-dimensional null hypothesis against high-dimensional alternative models.
  8. knitr To get nice table printing formats, mainly for the contingency tables.
  9. leaps To identify different best models of different sizes.
  10. markdown To render several output formats.
  11. MASS To access forward and backward selection algorithms
  12. randomforest To access random forest algorithms
  13. tidyr A part of the tidyverse used for data cleaning

Links to the view results of:

The analysis for Lifestyle articles is available here.
The analysis for Entertainment articles is available here.
The analysis for Business articles is available here.
The analysis for Social media articles is available here.
The analysis for Tech articles is available here.
The analysis for World articles is available here.

Code used to create the analysis.

selectID <- unique(newData$channel)  

output_file <- paste0(selectID, "")  

params = lapply(selectID, FUN = function(x){list(channel = x)})

reports <- tibble(output_file, params)


apply(reports, MARGIN = 1,
      FUN = function(x){
        render(input = "./Project_3.Rmd",
               output_format = "github_document", 
               output_file = x[[1]], 
               params = x[[2]])


Group project - Smitali Patnaik & Paula Bailey







