Large scale instance selection by means of federal instance selection

作者:

Highlights:

摘要

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is hundreds of thousands or millions. Most of the widely used instance selection algorithms are of complexity at least O(n2), n being the number of instances. When we face very large problems, the scalability becomes an issue, and most of the algorithms are not applicable.This paper presents a methodology for scaling up instance selection algorithms by means of a parallel procedure that performs instance selection on small subsets of the original dataset. The results obtained with the application of instance selection to small subsets are combined using a voting scheme. The method achieves a very good performance in terms of testing error and storage reduction, while the execution time of the process is decreased very significantly. The parallel algorithm also removes any kind of constraint imposed by memory size, as the whole dataset does not need to be stored in memory.The usefulness of our method is shown by an extensive comparison using 35 datasets of medium and large sizes from the UCI Machine Learning Repository. Additionally, our method is applied to eight very large datasets with very good results and fast execution time.

论文关键词:Instance selection,Instance-based learning,Parallel algorithms,Very large problems,Scalability

论文评审过程:Received 3 August 2010, Revised 14 March 2012, Accepted 14 March 2012, Available online 23 March 2012.

论文官网地址:https://doi.org/10.1016/j.datak.2012.03.002