Tutorial on practical tips of the most influential data preprocessing algorithms in data mining

作者:

Highlights:

摘要

Data preprocessing is a major and essential stage whose main goal is to obtain final data sets that can be considered correct and useful for further data mining algorithms. This paper summarizes the most influential data preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of data for imbalanced preprocessing. They constitute all among the most important topics in data preprocessing research and development. This paper emphasizes on the most well-known preprocessing methods and their practical study, selected after a recent, generic book on data preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different data sets that provide useful tips for the use of preprocessing algorithms. In the first place, we graphically present the effects on two benchmark data sets for the preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL’2014 Big Data competition to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers.

论文关键词:Data preprocessing,Data reduction,Missing values imputation,Noise filtering,Dimensionality reduction,Instance reduction,Discretization,Data mining

论文评审过程:Received 24 April 2015, Revised 11 December 2015, Accepted 14 December 2015, Available online 21 December 2015, Version of Record 9 March 2016.

论文官网地址:https://doi.org/10.1016/j.knosys.2015.12.006