Two time-efficient gibbs sampling inference algorithms for biterm topic model

摘要

Biterm Topic Model (BTM) is an effective topic model proposed to handle short texts. However, its standard gibbs sampling inference method (StdBTM) costs much more time than that (StdLDA) of Latent Dirichlet Allocation (LDA). To solve this problem we propose two time-efficient gibbs sampling inference methods, SparseBTM and ESparseBTM, for BTM by making a tradeoff between space and time consumption in this paper. The idea of SparseBTM is to reduce the computation in StdBTM by both recycling intermediate results and utilizing the sparsity of count matrix \(\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}\). Theoretically, SparseBTM reduces the time complexity of StdBTM from O(|B| K) to O(|B| K w ) which scales linearly with the sparsity of count matrix \(\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}\) (K w ) instead of the number of topics (K) (K w < K, K w is the average number of non-zero topics per word type in count matrix \(\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}\)). Experimental results have shown that in good conditions SparseBTM is approximately 18 times faster than StdBTM. Compared with SparseBTM, ESparseBTM is a more time-efficient gibbs sampling inference method proposed based on SparseBTM. The idea of ESparseBTM is to reduce more computation by recycling more intermediate results through rearranging biterm sequence. In theory, ESparseBTM reduces the time complexity of SparseBTM from O(|B|K w ) to O(R|B|K w ) (0 < R < 1, R is the ratio of the number of biterm types to the number of biterms). Experimental results have shown that the percentage of the time efficiency improved by ESparseBTM on SparseBTM is between 6.4% and 39.5% according to different datasets.