The role of transitive closure in evaluating blocking methods for dirty entity resolution

作者:Mahdi Niknam, Behrouz Minaei-Bidgoli, Rouhollah Dianat

摘要

Entity resolution (ER) is a process that identifies duplicate records referring to a real-world entity and links them together in one or more datasets. As a first step toward reducing the number of required record comparisons, blocking methods attempt to group records that are likely to match. A proper evaluation of blocking methods for selecting the best one has a direct effect on the ultimate ER performance. Currently, the available metrics for evaluating blocking techniques exclusively assess their actual potential. However, it is possible to deduce new pairs from the identified ones in dirty datasets due to transitive closure between matching record pairs. In the present study, a modification of current metrics is proposed to obtain a more accurate evaluation of blocking methods taking into account transitive closure and the potential of blocking methods. Comparing the existing and proposed metrics for ten available blocking algorithms on two dirty datasets demonstrates that the proposed metrics correlate significantly with ER final performance.

论文关键词:Blocking methods, Entity resolution, Evaluation, Identification of duplicate records, Record linkage, Transitive closure

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10844-021-00676-3