A supervised gradient-based learning algorithm for optimized entity resolution

作者:

Highlights:

摘要

The task of probabilistic record linkage is to find and link records that refer to the same entity across several disparate data sources. The accurate linking of records (entity resolution) is an important task for the healthcare industry, government, law enforcement, and the private sector, for obvious reasons. However, finding exact matches of an entity can be challenging due to records with typographical, phonetical or other types of errors (noise) found across real-world data sources. Over the years, many comparison functions have been developed to relate pairs of records and produce a similarity score. With a pair of predefined thresholds, one may decide if records pairs match, do not match, or if they require further clerical review. Nevertheless, finding appropriate comparison functions, identity descriptors (fields), threshold values, and efficient classifiers remains a challenging task. In this study, we propose a supervised gradient-based learning model that can adjust its structure and parameters based on matching scores coming from many comparison functions (and applied to many fields), to efficiently classify the records. The design of this structure is transparent, and can potentially allow us to locate which comparison functions and fields are more significant to correctly link the records. To train this structure, we propose a novel performance index that can help learn how to separate matched from non-matched records. Results completed with the use of synthetic datasets affected by different levels of noise and real-world datasets show the effectiveness of the algorithm, which can significantly reduce the number of false positives, false negatives, and the number of records selected for review.

论文关键词:Record linkage,Entity resolution,Field selection,Comparison functions,Clerical review threshold,Autolink threshold,Gradient-descent,Decision model

论文评审过程:Received 14 September 2016, Revised 23 August 2017, Accepted 14 October 2017, Available online 16 October 2017, Version of Record 13 November 2017.

论文官网地址:https://doi.org/10.1016/j.datak.2017.10.004