Monday, 23 April 2012

Identifying Topics in Social Media Posts using DBpedia

Topic recognition (a.k.a. topic identification) refers to the task of identifying the central ideas in a text. In the context of social media, topic recognition may be useful for many different purposes, such as automatically summarising the content published in a channel, mining the interests of a given user, etc.

In the context of communication and advertising companies, topic recognition in social media posts can provide several benefits, increasing the effectiveness of the investment in social media advertising which suffers from a significant degree of inefficiency. We envision that automatic topic identification will lead to an efficient investment in media advertising, since advertisement actions will be focussed in the appropriate channels and directed to the most suitable set of users.

Despite that some efforts have been done to structure social media information, such as Twitlogic, there is still the need for approaches able to cope with the different channels in the social web and with the challenges they pose. Social media posts are characterised by containing text that varies in length from short sentences in microblogs to medium-size articles in web logs. Very often, text published in social media contains misspellings, is completely written in uppercase or lowercase letters, or it is composed of set phrases, what leads to incorrectly identified topics. As an example, for the Spanish language, the absence of an accent in a word may give such word a completely different meaning. For such case, it is very important for the topic identification method to take into account the context of the post.

In the paper "Identifying Topics in Social Media Posts using DBpedia", we present a method that combines NLP (natural language processing), tag-based and semantic-based techniques for identifying the topics in posts published in social media. Such method exploits the semantics of the resources published in the web of data. More specifically, the method makes use of DBpedia, a semantic representation of part of Wikipedia information.

Therefore, the topics identified by our method are expressed in terms of DBpedia resources. We consider that DBpedia resources are a good starting point to define keyword meanings due to the fact that a huge part of the knowledge base is related to classes in the DBpedia Ontology. Moreover, currently the DBpedia ontology has 1,667,000 instances. In addition DBpedia resources are linked to other linked data sources and ontologies such as Geonames, YAGO, OpenCyc, and WordNet, providing more semantic information in the form of relations such as typeOf and sameAs. Therefore by linking social media with DBpedia resources we can profit not only from the DBpedia Ontology and the knowledge base facts but also from the interlinked semantic information.
Related paper:

No comments: