Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings

作者:

Highlights:

摘要

In this study, we address the problem of clustering string data in an unsupervised manner by developing a theory of a mixture model and an EM algorithm for strings based on probability theory on a topological monoid of strings developed in our previous studies. We begin with introducing a parametric probability distribution on a set of strings, which has location and dispersion parameters of a string and positive real number. We develop an iteration algorithm for estimating the parameters of the mixture model of the distributions introduced and demonstrate that our algorithm converges to the EM algorithm, which cannot be explicitly written for this mixture model, with probability one and strongly consistently estimates its parameters as the numbers of observed strings and iterations increase. We finally derive a procedure for unsupervised string clustering that is asymptotically optimal in the sense that the posterior probability of making correct classifications is maximized.

论文关键词:Unsupervised string clustering,Topological monoid of strings,Probability theory,Statistical asymptotics,Convergence of a sequence of algorithms

论文评审过程:Received 24 June 2017, Revised 3 March 2019, Accepted 9 July 2019, Available online 17 July 2019, Version of Record 8 August 2019.

论文官网地址:https://doi.org/10.1016/j.jcss.2019.07.003