Semantically-grounded construction of centroids for datasets with textual attributes

作者:

Highlights:

摘要

Centroids are key components in many data analysis algorithms such as clustering or microaggregation. They are considered as the central value that minimises the distance to all the objects in a dataset or cluster. Methods for centroid construction are mainly devoted to datasets with numerical and categorical attributes, focusing on the numerical and distributional properties of data. Textual attributes, on the contrary, consist of term lists referring to concepts with a specific semantic content (i.e., meaning), which cannot be evaluated by means of classical numerical operators. Hence, the centroid of a dataset with textual attributes should be the term that minimises the semantic distance against the members of the set. Semantically-grounded methods aiming to construct centroids for datasets with textual attributes are scarce and, as it will be discussed in this paper, they are hampered by their limited semantic analysis of data. In this paper, we propose a method that, exploiting the knowledge provided by background ontologies (like WordNet), is able to construct the centroid of multivariate datasets described by means of textual attributes. Special efforts have been put in the minimisation of the semantic distance between the centroid and the input data. As a result, our method is able to provide optimal centroids (i.e., those that minimise the distance to all the objects in the dataset) according to the exploited background ontology and a semantic similarity measure. Our proposal has been evaluated by means of a real dataset consisting on short textual answers provided by visitors of a natural park. Results show that our centroids retain the semantic content of the input data better than related works.

论文关键词:Data analysis,Centroid,Clustering,Semantic similarity,Ontologies

论文评审过程:Received 29 November 2011, Revised 22 March 2012, Accepted 29 April 2012, Available online 8 May 2012.

论文官网地址:https://doi.org/10.1016/j.knosys.2012.04.030