ScLink: supervised instance matching system for heterogeneous repositories

作者:Khai Nguyen, Ryutaro Ichise

摘要

Instance matching is the finding of co-referent instances that describe the same real-world object across two different repositories. For this problem, the heterogeneity, also known as the differences of objects’ attributes and repositories’ schema, is a challenging issue. It creates the limitations in the accuracy of existing solutions. In order to match the instances of heterogeneous repositories, a matching system can follow a configuration that specifies the equivalent properties, suitable similarity metrics, and other important parameters. This configuration can be created manually or automatically by learning methods. We present ScLink, an instance matching system that can generate a configuration automatically. In ScLink, we install two novel supervised learning algorithms, cLearn and minBlock. cLearn applies an apriori-like heuristic for finding the optimal combination of matching properties and similarity metrics. minBlock finds a blocking model, which aims at optimally reducing the pairwise alignments of instances between input repositories. In addition, ScLink introduces other techniques to take into account the scalability issue on large repositories. Experimental results on standard and very large datasets find that minBlock and cLearn are very effective and efficient. cLearn is also significantly better than existing configuration learning algorithms. It drastically boosts the accuracy of ScLink and makes the system outperform the state-of-the-arts, even when being trained using a small amount of labeled data.

论文关键词:Instance matching, Blocking, Schema-independent, Supervised, Configuration

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10844-016-0426-3