A Framework for Measuring Differences in Data Characteristics

作者:

Highlights:

摘要

A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce. In addition to being a quantitative, intuitively interpretable measure of difference, the deviation between two datasets can also be computed very fast. Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers, and clusters, and captures standard measures of deviation such as the misclassification rate and the chi-squared metric as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models is significant (i.e., whether the underlying datasets have statistically significant differences in their characteristics), and discuss several practical applications.

论文关键词:

论文评审过程:Received 10 September 1999, Revised 14 October 1999, Available online 11 June 2002.

论文官网地址:https://doi.org/10.1006/jcss.2001.1808