Assessing data quality – A probability-based metric for semantic consistency

作者：

Highlights：

• A probability-based metric for the data quality dimension semantic consistency is proposed.

• It allows to consider rules expected to be fulfilled with specific probabilities.

• The metric values are based on statistical tests and have a clear interpretation.

• The practical applicability of the metric is demonstrated in a real-world setting.

• Here, the metric identified consistency problems and supported decision-making.

摘要

We present a probability-based metric for semantic consistency using a set of uncertain rules. As opposed to existing metrics for semantic consistency, our metric allows to consider rules that are expected to be fulfilled with specific probabilities. The resulting metric values represent the probability that the assessed dataset is free of internal contradictions with regard to the uncertain rules and thus have a clear interpretation. The theoretical basis for determining the metric values are statistical tests and the concept of the p-value, allowing the interpretation of the metric value as a probability. We demonstrate the practical applicability and effectiveness of the metric in a real-world setting by analyzing a customer dataset of an insurance company. Here, the metric was applied to identify semantic consistency problems in the data and to support decision-making, for instance, when offering individual products to customers.

论文关键词：Data quality,Data quality assessment,Data quality metric,Data consistency

论文评审过程：Received 7 October 2017, Revised 28 March 2018, Accepted 28 March 2018, Available online 6 April 2018, Version of Record 5 May 2018.

论文官网地址：https://doi.org/10.1016/j.dss.2018.03.011