A novel data repairing approach based on constraints and ensemble learning

作者:

Highlights:

摘要

Data repairing is an important task in data mining. This paper proposes a novel data repairing approach based on a combination of constraints and ensemble learning. At first, functional dependencies (FDs) are used as constraints to identify inconsistent records. For each FD, all repeated values in the correct records are discovered. After that, noisy attributes in erroneous records are detected using correct records and the repeated values. To correct the detected noises, a supervised ensemble learning model is constructed for each attribute. The ensemble model consists of a Bayes classifier, a decision tree, and a MultiLayer Perceptron (MLP). A majority of votes is used as the combination strategy in the ensemble learning model. The proposed approach automatically repairs data without any user interaction. Moreover, the proposed method can detect more than one noise in a record. Experimental results show that our approach outperforms similar repairing algorithms (HoloClean and KATARA) in both terms of precision and recall.

论文关键词:Data repairing,Noise detection,Functional dependency,Ensemble learning

论文评审过程:Received 18 August 2019, Revised 25 December 2019, Accepted 1 May 2020, Available online 8 May 2020, Version of Record 1 June 2020.

论文官网地址:https://doi.org/10.1016/j.eswa.2020.113511