Incorporating word embeddings into topic modeling of short text

作者:Wang Gao, Min Peng, Hua Wang, Yanchun Zhang, Qianqian Xie, Gang Tian

摘要

Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.

论文关键词:Short text, Topic model, Word embeddings, Conditional Random Fields

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-018-1314-7