Óscar Muñoz-García Blog: State of the Art of predictive models from social media content

In the world of marketing and business, predicting real-world outcomes is a challenging task that normally requires indicators from heterogeneous data sources. For instance, traditional media content analysis has been used to forecast the financial market [1,2,3], and several works have demonstrated connections between online content and customer behaviour (e.g., purchase decisions). As social media feeds can be effective indicators of real-world performance [4], different forecasting models have been studied for using online chatter to predict real world outcomes related to the sales of different kinds of goods, such as movies [4,5,6] or books [7]. Predictive models range from gross income predictions [4,5,6,8,9], to revenue estimations per product distributor (i.e., stores that offer a product or service) [5], or spike predictions in sales ranks [7]. Besides, social media plays an increasingly important role in how customers discover and engage with various forms of content, including traditional media, such as TV. In this line, correlations have been found between online buzz and TV ratings [10].

Many of these social-media channels have started to be exploited to obtain the indicators that enable such prediction models (e.g., from Twitter [4], blog feeds [5,7], review texts[8], online news [6]). Indicators are based on volume, sentiment analysis, low-level textual features and combinations between them and economic data or product metadata.

Volume-based indicators can be simple or composed. Among the simple predictors we find the raw count of posts referring to a brand [5,6,7,10], the number of mentions for a brand (i.e., count of entity references, taking into account that one post can mention the same entity multiple times) [6], or the number of unique authors that refer to the brand [10]. Among composed predictors we find the post rate [4] (which denotes the rate at which publications about particular topics are created, i.e., the number of posts about a topic divided by time) and the post-per-source [10] (which measures the average number of posts published about a topic in particular feed sources, e.g., a set of forums). These volume-based indicators have been demonstrated to be effective. For example, spikes in references to books in blogs are likely to be followed by spikes in their sales [7].
Sentiment analysis-based indicators are based on the hypothesis that products that are talked about positively will produce better results than those discussed negatively, because positive and negative opinions influence people as they propagate through a social network. Basic sentiment-based predictors include the numbers of positive, negative or non-neutral posts (i.e., positive plus negative) about a brand [5]. Composite indicators include the positive and negative ratios [6] (i.e., the number of positive or negative posts divided by the total number of posts), and the mean or the variance of sentiment values [5]. Other important composite sentiment-based indicators include the Net Promoter Score (NPS), the polarity index and the subjectivity index. NPS is commonly used to gauge the loyalty of a firm’s customer relationships [6]. NPS can be obtained by dividing the difference of positive and negative posts by the total number of posts. The polarity index is calculated in different manners: by dividing the posts with positive sentiment by the post with negative sentiment [4,5], or by dividing the posts with positive sentiment by the number of non-neutral posts [6]. Subjectivity is measured by dividing the number of non-neutral post by the number of neutral or total publications [6].
Low-level textual feature-based indicators, combined with metadata features, have been also demonstrated to achieve a good performance [8]. Such textual features include term ngrams, part-of-speech n-grams and dependency relations.

All these indicators can be combined with other numerical and categorical predictors, such as product metadata [5,6,8,9,10], advertising investment [10], overall budget [6,8], number of product distributors [5,6], or even, the Time Value of Money [6].

The forecasting models used range from linear or logistic regression models [4,6,8,10] to k-nearest neighbour models (k-NN) [6]. Gruhl et al. [7] base their models on time-series analysis and construct a moving average predictor [11], a weighted least squares predictor, and a Markov predictor. Sharda and Delen [9] convert the forecasting problem into a classification problem by discretising the continuous predicted variables to a finite number of categories, and then they use a neural network model for performing the classification.

The scale of the data is a key aspect when analysing online content. To get an idea, the work presented in [4] uses 2.98 million tweets from 1.2 million users, with feeds extracted hourly during three months; the Nielsen study about social TV [10] uses data from 250 TV programs and 150 million social media sites; and in [7] the authors analyse the daily rank values of 2,340 books over a period of four months.

Bibliography

1

G. Fung, J. Yu, and W. Lam, ``Stock prediction: Integrating text mining approach using real-time news''. In Proceedings of IEEE Int. Conference on Computational Intelligence for Financial Engineering, 2003, pp. 395–-402.

2

W. S. Chan, ``Stock price reaction to news and no-news: Drift and reversal after headlines''. Journal of Financial Economics, vol. 70, 2003, pp. 223–-260.

3

P. C. Tetlock, M. Saar-Tsechansky, and S. Macskassy, ``More than words: Quantifying language to measure firms’ fundamentals''. In Proceedings of 9th Annual Texas Finance Festival, May 2007.

4

Sitaram Asur and Bernardo A. Huberman. ``Predicting the Future with Social Media''. In WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01, 2010. IEEE Computer Society Washington, DC, USA.

5

Gilad Mishne and Natalie Glance. ``Predicting Movie Sales from Blogger Sentiment”. In Proceedings of AAAICAAW-06'', the Spring Symposia on Computational Approaches to Analyzing Weblogs, 2006.

6

Wenbin Zhang and Steven Skiena. ``Improving Movie Gross Prediction through News Analysis''. Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. 2009. Milan, Italy

7

Daniel Gruhl, R Guha, Ravi Kumar, Jasmine Novak and Andrew Tomkins, ``The predictive power of online chatter'', in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 78-87

8

Mahesh Joshi, Dipanjan Das, Kevin Gimpel and Noah A Smith, ``Movie Reviews and Revenues : An Experiment in Text Regression'', in Proceedings of NAACL-HLT. 2010

9

R Sharda and D Delen, ``Predicting box-office success of motion pictures with neural networks.'' Expert Systems with Applications, 2006

10

Subramanyam Radha. ``The Relationship Between Social Media Buzz and TV Ratings''. Nielsen, Oct 2011. http://blog.nielsen.com/nielsenwire/online_mobile/the-relationship-between-social-media-buzz-and-tv-ratings/

11

G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. ``Time Series Analysis, Forecasting and Control''. Prentice Hall, 1994.

Óscar Muñoz-García Blog

Links

Monday, 27 February 2012

State of the Art of predictive models from social media content

Bibliography

No comments: