Element matching across data-oriented XML sources using a multi-strategy clustering model

作者:

Highlights:

摘要

We describe a family of heuristics-based clustering strategies to support the merging of XML data from multiple sources. As part of this research, we have developed a comprehensive classification for schematic and semantic conflicts that can occur when reconciling related XML data from multiple sources. Given the fact that element clustering is compute-intensive, especially when comparing large numbers of data elements that exhibit great representational diversity, performance is a critical, yet so far neglected aspect of the merging process. We have developed five heuristics for clustering data in the multi-dimensional metric space. Equivalence of data elements within the individual clusters is determined using several distance functions that calculate the semantic distances among the elements.The research described in this article is conducted within the context of the Integration Wizard (IWIZ) project at the University of Florida. IWIZ enables users to access and retrieve information from multiple XML-based sources through a consistent, integrated view. The results of our qualitative analysis of the clustering heuristics have validated the feasibility of our approach as well as its superior performance when compared to other similarity search techniques.

论文关键词:Element matching,Information integration,Object clustering,Reconciliation,XML

论文评审过程:Received 24 September 2002, Revised 5 February 2003, Accepted 11 June 2003, Available online 26 August 2003.

论文官网地址:https://doi.org/10.1016/j.datak.2003.06.001