A reliable cross-site user generated content modeling method based on topic model

作者：

Highlights：

•

摘要

Nowadays, social network sites (SNSs) have been significant platforms for content sharing in our daily life. With the emergence of different kinds of social network sites and users’ diverse needs for content sharing, their content sharing practices are generally taken place in multiple SNSs. To construct models that can characterize users’ content sharing practices in a composite context constituted by multiple social network sites (cross-site user generated content modeling) has been an emerging research topic in web data mining and human behavior research. However, previous methods such as Dirichlet Multinomial Mixture model (DMM), Biterm Topic Model (BTM), Twitter-LDA and MultiLDA have limited representation ability or are based on unreliable assumption, which cannot characterize the user generated content (UGC) accurately from the perspective of multiple SNSs. In this paper, we first conduct an empirical study to investigate the characteristics of users’ content sharing practices in cross-site context, based on which we propose a more reliable cross-site UGC model named CrossSite-LDA (C-LDA). We then evaluate the performances of the C-LDA model with four state-of-the-art models based on the two data sets sampled from Weibo–Douban and Facebook–Twitter. Results show that the C-LDA has better performances in perplexity, word coherence, topic KL divergence, UCI and UMass metrics compared with existing models, which suggests its superior accuracy on modeling users’ content characteristics in cross-site context.

论文关键词：Multiple social network sites,User generated content modeling,Topic model,Weibo,Douban

论文评审过程：Received 12 March 2020, Revised 11 August 2020, Accepted 1 September 2020, Available online 21 September 2020, Version of Record 23 September 2020.

论文官网地址：https://doi.org/10.1016/j.knosys.2020.106435