Discriminative subprofile-specific representations for author profiling in social media

作者:

Highlights:

摘要

The Author Profiling (AP) task aims to reveal as much as possible information from a given author’s document (e.g., age, gender, etc.). AP is crucial for several applications, ranging from customized advertising to computer forensics, psychology, and entertainment. Nonetheless, the AP task is far from being solved, particularly in social media domains, where the nature of documents hinder the applicability of state-of-the-art text mining tools (e.g., because of spelling-grammar errors, huge vocabularies, and the presence of many out-of-vocabulary terms). Currently, most of the work in AP for social media has been devoted to the development of descriptive features, which are used under standard representations, such as the Bag-of-Words (BoW). Nevertheless, BoW-like representations have some well known shortcomings, namely: (i) the sparsity and high dimensionality of the representation, and (ii) the failure to capture relationships, other than mere occurrence, among terms. This paper focuses on the study of alternative document representations that can deal with such issues. We propose a representation for documents that capture discriminative and subprofile-specific information of terms. Under the proposed representation, terms are represented in a vector space that captures discriminative information. Then, term representations are aggregated to represent the content of a document. In this manner, documents are represented in a low-dimensional (and discriminative) space which is non-sparse. We evaluate the effectiveness of the proposed representation on several corpora from the social media domain. The proposed representation is compared to the standard BoW representation and a wide variety of state-of-the-art AP approaches. Experimental results reveal that the proposed representation outperforms most of the reference methodologies. Furthermore, we show that the proposed representation is in agreement with previous studies on handcrafted attributes for AP.

论文关键词:Author profiling,Web mining,Text classification,Social media

论文评审过程:Received 8 December 2014, Revised 21 March 2015, Accepted 13 June 2015, Available online 2 July 2015, Version of Record 19 October 2015.

论文官网地址:https://doi.org/10.1016/j.knosys.2015.06.024