Principal Components Analysis Random Discretization Ensemble for Big Data

作者:

Highlights:

摘要

Humongous amounts of data have created a lot of challenges in terms of data computation and analysis. Classic data mining techniques are not prepared for the new space and time requirements. Discretization and dimensionality reduction are two of the data reduction tasks in knowledge discovery. Random Projection Random Discretization is a novel and recently proposed ensemble method by Ahmad and Brown in 2014 that performs discretization and dimensionality reduction to create more informative data. Despite the good efficiency of random projections in dimensionality reduction, more robust methods like Principal Components Analysis (PCA) can improve the performance.We propose a new ensemble method to overcome this drawback using the Apache Spark platform and PCA for dimension reduction, named Principal Components Analysis Random Discretization Ensemble. Experimental results on five large-scale datasets show that our solution outperforms both the original algorithm and Random Forest in terms of prediction performance. Results also show that high dimensionality data can affect the runtime of the algorithm.

论文关键词:Big Data,Discretization,Spark,Decision tree,PCA,Data reduction.

论文评审过程:Received 4 September 2017, Revised 6 February 2018, Accepted 8 March 2018, Available online 9 March 2018, Version of Record 26 May 2018.

论文官网地址:https://doi.org/10.1016/j.knosys.2018.03.012