Comprehensive relative importance analysis and its applications to high dimensional gene expression data analysis

作者:

Highlights:

摘要

Identification of important genes is challenging not only because of its high dimensional nature, but also because the expressions of genes from the same pathway are often highly correlated. A large number of feature selection methods have been proposed to select a subset of genes for interpretation and prediction of certain phenotypes. Among them, the L1 penalization-based methods, such as lasso, adaptive lasso and elastic net, gain most attentions. However, the L1 penalty employed by these methods is known to have difficulties in selection of a group of highly correlated features. The issue of identifying important highly correlated features, on the other hand, is well studied in the multiple regression analysis with a sufficient sample size. In particular, relative weight analysis is known effective in measuring the relative importance of correlated features. But the relative weight analysis suffers from the postulation of a full-column-rank feature matrix and is infeasible for high dimensional problems. In this research, a comprehensive relative importance analysis is proposed and proven valid without sample size and matrix rank restraints. Simulation and real cases are used to show the effectiveness of the proposed method in selecting relevant features especially for the high dimensional data.

论文关键词:Collinearity,Feature ranking,High dimensional,Small sample size,Relative importance,Singularity

论文评审过程:Received 21 February 2020, Revised 8 May 2020, Accepted 6 June 2020, Available online 8 June 2020, Version of Record 10 June 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106120