eapost / Online-News-Popularity

Predictive models from news data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Online-News-Popularity

The data of this assignment refer to characteristics of the popular website of Mashable (www.mashable.com). Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. All sites and related data were downloaded on January 8, 2015. The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method - see Fernandes et al. (2015) for more details on how the relative performance values were set.

The main variable of the study is the number of shares which measures the popularity of the site/post. We are interested to identify the ingredients of a successful post and what it takes to for a post to become a viral. Each student will handle a random sub-sample of 10000 observations to use it for training their model and for inference. All students will use a common evaluation/test dataset of 10000 observations.

  1. You should first do some exploratory data analysis. Visualizing the data should give you some insight into certain particularities of this dataset. Pairwize comparisons will help you also learn about the association implied by the data.
  2. The main aim is to identify the best model for predicting the popularity of a post. Select the appropriate features to predict your model. Be careful, your model should not be over-parameterized.
  3. Check the assumptions of the model and revise your procedure.
  4. Use 10-fold cross-validation to select your model and assess the out-of-sample predictive ability of the model.
  5. Use the test dataset to select your model and assess the out-of-sample predictive ability of the model.
  6. Compare results obtained by different methods under 2, 3 and 4.
  7. Select your final model and features and justify your choice.
  8. Interpret the parameters and the predicting performance of the final model.
  9. Describe the typical profile of a post and the characteristics of a viral post.
  10. Write a report summarizing your results.

Attribute Information:

Number of Attributes: 61 (58 explanatory attributes, 2 non-explanatory, 1 goal field response)

  1. url: URL of the article (non-explanatory )
  2. timedelta: Days between the article publication and the dataset acquisition (non-explanatory )
  3. n_tokens_title: Number of words in the title
  4. n_tokens_content: Number of words in the content
  5. n_unique_tokens: Rate of unique words in the content
  6. n_non_stop_words: Rate of non-stop words in the content
  7. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
  8. num_hrefs: Number of links
  9. num_self_hrefs: Number of links to other articles published by Mashable
  10. num_imgs: Number of images
  11. num_videos: Number of videos
  12. average_token_length: Average length of the words in the content
  13. num_keywords: Number of keywords in the metadata
  14. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
  15. data_channel_is_entertainment: Is data channel 'Entertainment'?
  16. data_channel_is_bus: Is data channel 'Business'?
  17. data_channel_is_socmed: Is data channel 'Social Media'?
  18. data_channel_is_tech: Is data channel 'Tech'?
  19. data_channel_is_world: Is data channel 'World'?
  20. kw_min_min: Worst keyword (min. shares)
  21. kw_max_min: Worst keyword (max. shares)
  22. kw_avg_min: Worst keyword (avg. shares)
  23. kw_min_max: Best keyword (min. shares)
  24. kw_max_max: Best keyword (max. shares)
  25. kw_avg_max: Best keyword (avg. shares)
  26. kw_min_avg: Avg. keyword (min. shares)
  27. kw_max_avg: Avg. keyword (max. shares)
  28. kw_avg_avg: Avg. keyword (avg. shares)
  29. self_reference_min_shares: Min. shares of referenced articles in Mashable
  30. self_reference_max_shares: Max. shares of referenced articles in Mashable
  31. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
  32. weekday_is_monday: Was the article published on a Monday?
  33. weekday_is_tuesday: Was the article published on a Tuesday?
  34. weekday_is_wednesday: Was the article published on a Wednesday?
  35. weekday_is_thursday: Was the article published on a Thursday?
  36. weekday_is_friday: Was the article published on a Friday?
  37. weekday_is_saturday: Was the article published on a Saturday?
  38. weekday_is_sunday: Was the article published on a Sunday?
  39. is_weekend: Was the article published on the weekend?
  40. LDA_00: Closeness to LDA topic 0
  41. LDA_01: Closeness to LDA topic 1
  42. LDA_02: Closeness to LDA topic 2
  43. LDA_03: Closeness to LDA topic 3
  44. LDA_04: Closeness to LDA topic 4
  45. global_subjectivity: Text subjectivity
  46. global_sentiment_polarity: Text sentiment polarity
  47. global_rate_positive_words: Rate of positive words in the content
  48. global_rate_negative_words: Rate of negative words in the content
  49. rate_positive_words: Rate of positive words among non-neutral tokens
  50. rate_negative_words: Rate of negative words among non-neutral tokens
  51. avg_positive_polarity: Avg. polarity of positive words
  52. min_positive_polarity: Min. polarity of positive words
  53. max_positive_polarity: Max. polarity of positive words
  54. avg_negative_polarity: Avg. polarity of negative words
  55. min_negative_polarity: Min. polarity of negative words
  56. max_negative_polarity: Max. polarity of negative words
  57. title_subjectivity: Title subjectivity
  58. title_sentiment_polarity: Title polarity
  59. abs_title_subjectivity: Absolute subjectivity level
  60. abs_title_sentiment_polarity: Absolute polarity level
  61. shares: Number of shares (target response)

About

Predictive models from news data


Languages

Language:R 100.0%