Entity matching in heterogeneous databases: A logistic regression approach

作者:

Highlights:

摘要

This paper examines a widely-encountered data heterogeneity problems—often faced in real-world decision support situations—called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. Supporting real-world decisions often requires one to identify which entity in one application is the same as another in a second application. Previous research has proposed decision models to resolve this problem. However, the implementation of those models requires either the estimation of probability parameters by manually matching a large sample of the existing data or the estimation of a distance measure based on user-specified weights. This paper proposes an alternative technique based on logistic regression for estimating the matching probabilities to be used in the matching decision model. This approach has been implemented and tested on real-world data. Comparison of the results with those from earlier approaches indicate that the proposed approach performs quite well and is certainly a viable approach in practical situations.

论文关键词:

论文评审过程:Received 5 October 2006, Revised 14 September 2007, Accepted 4 October 2007, Available online 18 October 2007.

论文官网地址:https://doi.org/10.1016/j.dss.2007.10.007