Óscar Muñoz-García Blog: 2012

Tuesday, 11 December 2012

Detecting browser fingerprint evolution for identifying unique users

User tracking consists in registering the activity of users as they interact with one or more websites so that such activity can be related with specific, uniquely identified users. Counting unique visitors of websites is an essential activity in order to perform web analytics, since many web analytics metrics depend on individuals counted only once (e.g., new visitors, return visitors, etc.).
Web analytics provides the measurement model to digital marketing, letting to analyse and measure the effectiveness of advertisement campaigns. Data gathered by applying web analytics (e.g., number of persons that have visualised a banner) is typically compared against key performance indicators (e.g., reach of a campaign), and used to improve the audience response to marketing campaigns (e.g., move the banner to a site with more audience). The most significant KPIs depend on counting unique visitors.

Uniquely identifying users is also needed for performing behavioural targeting, which involves tracking of the online activities of users in order to deliver tailored ads to them.
The technique most used for uniquely identifying users from captured web activity is the one that combines cookies and web bugs. This technique is being affected by several factors, such as, strict privacy restrictions implemented by web browsers, or the use of new devices for navigating the web that do not support cookies (e.g., many set-top boxes and certain video game consoles). Furthermore, several security programmes, such the antispyware ones, remove cookies periodically, making it difficult to trace recurring visits to websites. Thus, these security measures, enabled to protect the privacy of users, affect basic aggregated metrics obtained with web analytics, from which valuable business insights can be derived, such as the number of unique visitors of a website, or the bounce rate.
An alternative to cookies for uniquely identifying users consists of capturing distinctive technical attributes of the system used by such users to navigate the web (i.e., their browser fingerprint). While the effectiveness of this technique has been demonstrated, such technique is not entirely accurate, since browser fingerprint is built from attributes that evolve over time. Thus, changes in values of fingerprint attributes lead to incorrectly accounting new users.

In summary, existing techniques for counting unique visitors are losing effectiveness, because of privacy restrictions and new devices for navigating the web. The fingerprinting technique deals with such restrictions and devices, but is quite sensible to changes in the attributes of the web browser, which leads to counting unique visitors imprecisely. The paper "Detecting browser fingerprint evolution for identifying unique users" describes an algorithm, based on the fingerprinting technique, which allows identifying unique visitors accurately, regardless changes in browser attributes. For doing so, such algorithm is able to detect the evolution of fingerprints, therefore, effectively grouping distinct fingerprints that correspond to the same user.

Related paper:

"Detecting browser fingerprint evolution for identifying unique users"

Saturday, 26 May 2012

Comparing user generated content published in different social media sources

The growth of social media has populated the Web with valuable user generated content that can be exploited for many different and interesting purposes, such as, explaining or predicting real world outcomes through opinion mining. In this context, natural language processing techniques are a key technology for analysing user generated content. Such content is characterised by its casual language,
with short texts, misspellings, and set-phrases, among other characteristics that challenge content analysis.

The paper "Comparing user generated content published in different social media sources" shows the differences of the language used in heterogeneous social media sources, by analysing the distribution of the part-of-speech categories extracted from the analysis of the morphology of a sample of texts published in such sources. In addition, we evaluate the performance of three natural language processing techniques (i.e., language identification, sentiment analysis, and topic identification) showing the differences on accuracy when applying such techniques to different types of user generated content.

Related paper:

"Comparing user generated content published in different social media sources"

Monday, 23 April 2012

Identifying Topics in Social Media Posts using DBpedia

Topic recognition (a.k.a. topic identification) refers to the task of identifying the central ideas in a text. In the context of social media, topic recognition may be useful for many different purposes, such as automatically summarising the content published in a channel, mining the interests of a given user, etc.

In the context of communication and advertising companies, topic recognition in social media posts can provide several benefits, increasing the effectiveness of the investment in social media advertising which suffers from a significant degree of inefficiency. We envision that automatic topic identification will lead to an efficient investment in media advertising, since advertisement actions will be focussed in the appropriate channels and directed to the most suitable set of users.

Despite that some efforts have been done to structure social media information, such as Twitlogic, there is still the need for approaches able to cope with the different channels in the social web and with the challenges they pose. Social media posts are characterised by containing text that varies in length from short sentences in microblogs to medium-size articles in web logs. Very often, text published in social media contains misspellings, is completely written in uppercase or lowercase letters, or it is composed of set phrases, what leads to incorrectly identified topics. As an example, for the Spanish language, the absence of an accent in a word may give such word a completely different meaning. For such case, it is very important for the topic identification method to take into account the context of the post.

In the paper "Identifying Topics in Social Media Posts using DBpedia", we present a method that combines NLP (natural language processing), tag-based and semantic-based techniques for identifying the topics in posts published in social media. Such method exploits the semantics of the resources published in the web of data. More specifically, the method makes use of DBpedia, a semantic representation of part of Wikipedia information.

Therefore, the topics identified by our method are expressed in terms of DBpedia resources. We consider that DBpedia resources are a good starting point to define keyword meanings due to the fact that a huge part of the knowledge base is related to classes in the DBpedia Ontology. Moreover, currently the DBpedia ontology has 1,667,000 instances. In addition DBpedia resources are linked to other linked data sources and ontologies such as Geonames, YAGO, OpenCyc, and WordNet, providing more semantic information in the form of relations such as typeOf and sameAs. Therefore by linking social media with DBpedia resources we can profit not only from the DBpedia Ontology and the knowledge base facts but also from the interlinked semantic information.

Monday, 27 February 2012

State of the Art of predictive models from social media content

In the world of marketing and business, predicting real-world outcomes is a challenging task that normally requires indicators from heterogeneous data sources. For instance, traditional media content analysis has been used to forecast the financial market [1,2,3], and several works have demonstrated connections between online content and customer behaviour (e.g., purchase decisions). As social media feeds can be effective indicators of real-world performance [4], different forecasting models have been studied for using online chatter to predict real world outcomes related to the sales of different kinds of goods, such as movies [4,5,6] or books [7]. Predictive models range from gross income predictions [4,5,6,8,9], to revenue estimations per product distributor (i.e., stores that offer a product or service) [5], or spike predictions in sales ranks [7]. Besides, social media plays an increasingly important role in how customers discover and engage with various forms of content, including traditional media, such as TV. In this line, correlations have been found between online buzz and TV ratings [10].

Many of these social-media channels have started to be exploited to obtain the indicators that enable such prediction models (e.g., from Twitter [4], blog feeds [5,7], review texts[8], online news [6]). Indicators are based on volume, sentiment analysis, low-level textual features and combinations between them and economic data or product metadata.

Volume-based indicators can be simple or composed. Among the simple predictors we find the raw count of posts referring to a brand [5,6,7,10], the number of mentions for a brand (i.e., count of entity references, taking into account that one post can mention the same entity multiple times) [6], or the number of unique authors that refer to the brand [10]. Among composed predictors we find the post rate [4] (which denotes the rate at which publications about particular topics are created, i.e., the number of posts about a topic divided by time) and the post-per-source [10] (which measures the average number of posts published about a topic in particular feed sources, e.g., a set of forums). These volume-based indicators have been demonstrated to be effective. For example, spikes in references to books in blogs are likely to be followed by spikes in their sales [7].
Sentiment analysis-based indicators are based on the hypothesis that products that are talked about positively will produce better results than those discussed negatively, because positive and negative opinions influence people as they propagate through a social network. Basic sentiment-based predictors include the numbers of positive, negative or non-neutral posts (i.e., positive plus negative) about a brand [5]. Composite indicators include the positive and negative ratios [6] (i.e., the number of positive or negative posts divided by the total number of posts), and the mean or the variance of sentiment values [5]. Other important composite sentiment-based indicators include the Net Promoter Score (NPS), the polarity index and the subjectivity index. NPS is commonly used to gauge the loyalty of a firm’s customer relationships [6]. NPS can be obtained by dividing the difference of positive and negative posts by the total number of posts. The polarity index is calculated in different manners: by dividing the posts with positive sentiment by the post with negative sentiment [4,5], or by dividing the posts with positive sentiment by the number of non-neutral posts [6]. Subjectivity is measured by dividing the number of non-neutral post by the number of neutral or total publications [6].
Low-level textual feature-based indicators, combined with metadata features, have been also demonstrated to achieve a good performance [8]. Such textual features include term ngrams, part-of-speech n-grams and dependency relations.

All these indicators can be combined with other numerical and categorical predictors, such as product metadata [5,6,8,9,10], advertising investment [10], overall budget [6,8], number of product distributors [5,6], or even, the Time Value of Money [6].

The forecasting models used range from linear or logistic regression models [4,6,8,10] to k-nearest neighbour models (k-NN) [6]. Gruhl et al. [7] base their models on time-series analysis and construct a moving average predictor [11], a weighted least squares predictor, and a Markov predictor. Sharda and Delen [9] convert the forecasting problem into a classification problem by discretising the continuous predicted variables to a finite number of categories, and then they use a neural network model for performing the classification.

The scale of the data is a key aspect when analysing online content. To get an idea, the work presented in [4] uses 2.98 million tweets from 1.2 million users, with feeds extracted hourly during three months; the Nielsen study about social TV [10] uses data from 250 TV programs and 150 million social media sites; and in [7] the authors analyse the daily rank values of 2,340 books over a period of four months.

Bibliography

1

G. Fung, J. Yu, and W. Lam, ``Stock prediction: Integrating text mining approach using real-time news''. In Proceedings of IEEE Int. Conference on Computational Intelligence for Financial Engineering, 2003, pp. 395–-402.

2

W. S. Chan, ``Stock price reaction to news and no-news: Drift and reversal after headlines''. Journal of Financial Economics, vol. 70, 2003, pp. 223–-260.

3

P. C. Tetlock, M. Saar-Tsechansky, and S. Macskassy, ``More than words: Quantifying language to measure firms’ fundamentals''. In Proceedings of 9th Annual Texas Finance Festival, May 2007.

4

Sitaram Asur and Bernardo A. Huberman. ``Predicting the Future with Social Media''. In WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01, 2010. IEEE Computer Society Washington, DC, USA.

5

Gilad Mishne and Natalie Glance. ``Predicting Movie Sales from Blogger Sentiment”. In Proceedings of AAAICAAW-06'', the Spring Symposia on Computational Approaches to Analyzing Weblogs, 2006.

6

Wenbin Zhang and Steven Skiena. ``Improving Movie Gross Prediction through News Analysis''. Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. 2009. Milan, Italy

7

Daniel Gruhl, R Guha, Ravi Kumar, Jasmine Novak and Andrew Tomkins, ``The predictive power of online chatter'', in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 78-87

8

Mahesh Joshi, Dipanjan Das, Kevin Gimpel and Noah A Smith, ``Movie Reviews and Revenues : An Experiment in Text Regression'', in Proceedings of NAACL-HLT. 2010

9

R Sharda and D Delen, ``Predicting box-office success of motion pictures with neural networks.'' Expert Systems with Applications, 2006

10

Subramanyam Radha. ``The Relationship Between Social Media Buzz and TV Ratings''. Nielsen, Oct 2011. http://blog.nielsen.com/nielsenwire/online_mobile/the-relationship-between-social-media-buzz-and-tv-ratings/

11

G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. ``Time Series Analysis, Forecasting and Control''. Prentice Hall, 1994.

Wednesday, 8 February 2012

Hacia la televisión social

La medición de audiencias permite averiguar a las cadenas emisoras de contenido audiovisual los contenidos con mayor aceptación y facilita a los anunciantes la optimización de la inversión publicitaria en los espacios televisivos.

El consumo de televisión está cambiando de un escenario en el que la interacción con el aparato de TV se limita al cambio de canal, a otro en el que se produce una participación activa, pública y espontánea de la audiencia, como respuesta a la emisión televisiva.

Este cambio permite implementar nuevas técnicas de medición de audiencia, descritas en la presentación siguiente:

Social TV, más allá de la audiencia. Participación y relaciones from omunozgarcia

Links