Towards high dimensional instance selection: An evolutionary approach

作者:

Highlights:

• An efficient genetic algorithm (EGA) is proposed for the data reduction problem.

• Compared with GA, EGA contains four novel components.

• The experimental results show that EGA performs the best in terms of classification accuracy.

• EGA can produce the largest reduction rates and requires much less computational time than GA.

摘要

Data reduction is an important data pre-processing step in the KDD process. It can be approached by the application of some instance selection algorithms to filter out unrepresentative or noisy data from a given (training) dataset. However, the performance of instance selection over very high dimensional data has not yet been fully examined. In this paper, we introduce a novel efficient genetic algorithm (EGA), which fits “biological evolution” into the evolutionary process. In other words, after long-term evolution, individuals find the most efficient way to allocate resources and evolve. The experimental study is based on four very high dimensional datasets ranging from 200 to 18,236 dimensions. In addition, four state-of-the-art algorithms including IB3, DROP3, ICF, and GA are compared with EGA. The experimental results show that EGA allows the k-NN and SVM classifiers to provide the most comparable classification performance with the baseline classifiers without instance selection. Particularly, EGA outperforms the four algorithms in terms of average classification accuracy. Moreover, EGA can produce the largest reduction rates (the same as GA) and it requires relatively less computational time than the other four algorithms.

论文关键词:Data reduction,Instance selection,Data mining,Machine learning,Genetic algorithms,High dimensional data

论文评审过程:Received 28 March 2013, Revised 23 December 2013, Accepted 28 January 2014, Available online 5 February 2014.

论文官网地址:https://doi.org/10.1016/j.dss.2014.01.012