Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm

作者:

Highlights:

摘要

We develop a general sequence-based clustering method by proposing new sequence representation schemes in association with Markov models. The resulting sequence representations allow for calculation of vector-based distances (dissimilarities) between Web user sessions and thus can be used as inputs of various clustering algorithms. We develop an evaluation framework in which the performances of the algorithms are compared in terms of whether the clusters (groups of Web users who follow the same Markov process) are correctly identified using a replicated clustering approach. A series of experiments is conducted to investigate whether clustering performance is affected by different sequence representations and different distance measures as well as by other factors such as number of actual Web user clusters, number of Web pages, similarity between clusters, minimum session length, number of user sessions, and number of clusters to form. A new, fuzzy ART-enhanced K-means algorithm is also developed and its superior performance is demonstrated.

论文关键词:Web usage mining,Clustering methods,Simulation,Artificial intelligence,Markov chain

论文评审过程:Received 5 June 2007, Revised 18 December 2007, Accepted 28 January 2008, Available online 6 February 2008.

论文官网地址:https://doi.org/10.1016/j.datak.2008.01.002