A fuzzy document clustering approach based on domain-specified ontology

作者:

Highlights:

摘要

Document clustering techniques include automatic document organization, topic extraction, fast information retrieval or filtering, etc. Numerous methods have been developed for document clustering research. Despite the advances achieved, however, document clustering still presents certain challenges such as optimizing feature selection for low-dimensional document representation and incorporating mutual information between the documents into a clustering algorithm. This paper mainly focuses on these two questions. First, we construct a domain-specific ontology that provides the controlled vocabulary describing the hazards related to dairy products. Synonyms of the controlled vocabulary in document set are considered to be relatively prevalent and fundamentally important for feature selection. Second, in combination with the vector space model (VSM), we perform singular value decomposition (SVD) to translate all of the term-document vectors into a concept space. We then obtain the mutual information between documents by calculating the similarity of every two document vectors in the orthogonal matrix of right singular vectors. As the mutual information matrix is also a fuzzy compatible relation, a fuzzy equivalence can be derived by calculating max–min transitive closure. Finally, based on the fuzzy equivalence relation, all of the data sequences are easily allocated into clusters under the guidance of a cluster validation index. Our method both reduces the dimensionality of the original data and considers the correlation between the terms. The experimental results show that encoding the ontologies in the aggregation process could provide better clustering results. Moreover, the proposed work has been applied to food safety supervision which is beneficial for government and society.

论文关键词:Domain-specified ontology,Document clustering,Feature selection,Singular value decomposition (SVD),Fuzzy equivalence relation

论文评审过程:Received 26 November 2013, Revised 10 April 2015, Accepted 25 April 2015, Available online 26 June 2015, Version of Record 10 November 2015.

论文官网地址:https://doi.org/10.1016/j.datak.2015.04.008