Centralized vs. distributed feature selection methods based on data complexity measures

作者：

Highlights：

• A methodology for distributing the process of feature selection based on several data complexity measures is proposed.

• We tackled the two strategies to partition the datasets: horizontal (i.e. by samples) and vertical (i.e. by features).

• We present an experimental study on 11 datasets (five of them microarrays) in terms of number of selected features, classification accuracy and running time.

• The novel procedures are able to reduce significantly the running time while maintaining (or even improving) the classification performance.

摘要

•A methodology for distributing the process of feature selection based on several data complexity measures is proposed.•We tackled the two strategies to partition the datasets: horizontal (i.e. by samples) and vertical (i.e. by features).•We present an experimental study on 11 datasets (five of them microarrays) in terms of number of selected features, classification accuracy and running time.•The novel procedures are able to reduce significantly the running time while maintaining (or even improving) the classification performance.

论文关键词：Distributed learning,Feature selection,Data complexity measures,Classification

论文评审过程：Received 10 February 2016, Revised 6 September 2016, Accepted 26 September 2016, Available online 28 September 2016, Version of Record 20 December 2016.

论文官网地址：https://doi.org/10.1016/j.knosys.2016.09.022