A word embedding-based approach to cross-lingual topic modeling

摘要

The cross-lingual topic analysis aims at extracting latent topics from corpora of different languages. Early approaches rely on high-cost multilingual resources (e.g., a parallel corpus), which is hard to come by in many real cases. Some works only require a translation dictionary as a linkage between languages; however, when given an inappropriate dictionary (e.g., small coverage of dictionary), the cross-lingual topic model would shrink to a monolingual topic model and generate less diversified topics. Therefore, it is imperative to investigate a cross-lingual topic model requiring fewer bilingual resources. Recently, some space-mapping techniques have been proposed to help align multiple word embedding of different languages into a quality cross-lingual word embedding by referring to a small number of translation pairs. This work proposes a cross-lingual topic model, called Cb-CLTM, which incorporates with cross-lingual word embedding. To leverage the power of word semantics and the linkage between languages from the cross-lingual word embedding, the Cb-CLTM considers each word as a continuous embedding vector rather than a discrete word type. The experiments demonstrate that, when cross-lingual word space exhibits strong isomorphism, Cb-CLTM can generate more coherent topics with higher diversity and induce better representations of documents across languages for further tasks such as cross-lingual document clustering and classification. When the cross-lingual word space is less isomorphic, Cb-CLTM generates less coherent topics yet still prevails in topic diversity and document classification.