Identity matching and information acquisition: Estimation of optimal threshold parameters

Highlights：

• We analyze statistical challenges of calibrating matching software through sampling.

• We address this gap in the literature through an in-depth analysis of various estimators.

• Our analysis reveals that the Cost-Based estimator exhibits the ‘best’ performance.

• We report the effects of nonlinearity and nonmonotonicity on the convergence rates.

• With large samples, convergence rates are faster than basic population mean estimator.

摘要

With the growing volume of collected and stored data from customer interactions that have recently shifted towards online channels, an important challenge faced by today's businesses is appropriately dealing with data quality problems. A key step in the data cleaning process is the matching and merging of customer records to assess the identity of individuals. The practical importance of this research is exemplified by a large client firm that deals with private label credit cards. They needed to know whether there existed histories of new customers within the company, in order to decide on the appropriate parameters of possible card offerings. The company incurs substantial costs if they incorrectly “match” an incoming application with an existing customer (Type I error), and also if they falsely assume that there is no match (Type II error). While there is a good deal of generic identity matching software available, that will provide a “strength” score for each potential match, the question of how to use the scores for new applications is of great interest and is addressed in this work. The academic significance lies in the analysis of the score thresholds that are typically used in decision making. That is, upper and lower thresholds are set, where matches are accepted above the former, rejected below the latter, and more information is gathered between the two. We show, for the first time, that the optimal thresholds can be considered to be parameters of a matching distribution, and a number of estimators of these parameters are developed and analyzed. Then extensive computations show the effects of various factors on the convergence rates of the estimates.