Combining schema and instance information for integrating heterogeneous data sources

作者：

Highlights：

•

摘要

Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.

论文关键词：Heterogeneous databases,Data integration,Semantic correspondence

论文评审过程：Received 26 March 2006, Accepted 10 June 2006, Available online 10 July 2006.

论文官网地址：https://doi.org/10.1016/j.datak.2006.06.004