Robustness, replicability and scalability in topic modelling

作者:

Highlights:

• We identify three key properties a topic model must exhibit for effective application in the social sciences: statistical robustness, descriptive power across all dimensions and reflection of reality.

• We propose a simple approach for estimating the statistical robustness of topic models that is based on pairwise similarity scores between documents.

• Applying that approach we find that the neural network-based Doc2Vec is more stable than the other topic models tested: Latent Dirichlet Allocation and Non-negative Matrix Factorisation.

• We further propose a principal component analysis based approach for assessing the descriptive power of topic models. In applying that approach we find that Doc2Vec performs the best, but LDA does also perform well.

• We provide grounds for the application of neural embeddings approaches in the social sciences and also how traditional visualisation techniques can be applied directly to dense vector representations.

摘要

•We identify three key properties a topic model must exhibit for effective application in the social sciences: statistical robustness, descriptive power across all dimensions and reflection of reality.•We propose a simple approach for estimating the statistical robustness of topic models that is based on pairwise similarity scores between documents.•Applying that approach we find that the neural network-based Doc2Vec is more stable than the other topic models tested: Latent Dirichlet Allocation and Non-negative Matrix Factorisation.•We further propose a principal component analysis based approach for assessing the descriptive power of topic models. In applying that approach we find that Doc2Vec performs the best, but LDA does also perform well.•We provide grounds for the application of neural embeddings approaches in the social sciences and also how traditional visualisation techniques can be applied directly to dense vector representations.

论文关键词:Scientometrics,Topic modelling,Stability,Robustness,Similarity,Informetrics

论文评审过程:Received 11 May 2021, Revised 17 October 2021, Accepted 4 November 2021, Available online 20 November 2021, Version of Record 20 November 2021.

论文官网地址:https://doi.org/10.1016/j.joi.2021.101224