Fast, scalable and geo-distributed PCA for big data analytics

作者:

Highlights:

• An efficient block-division approach for PCA on arbitrarily large dimensional data

• Highly scalable algorithm which avoids memory-overflow error for big data

• Fast and communication-efficient accumulation scheme in geo-distributed environment

• An optimized Spark implementation which is 10× more scalable and 1.1 − 42× faster

• 1.3 − 2.9× improvement in running time on geo-distributed environment

摘要

•An efficient block-division approach for PCA on arbitrarily large dimensional data•Highly scalable algorithm which avoids memory-overflow error for big data•Fast and communication-efficient accumulation scheme in geo-distributed environment•An optimized Spark implementation which is 10× more scalable and 1.1 − 42× faster•1.3 − 2.9× improvement in running time on geo-distributed environment

论文关键词:Big data,PCA,Dimensionality reduction,Geo-distributed algorithm

论文评审过程:Received 26 May 2019, Revised 19 November 2020, Accepted 28 December 2020, Available online 6 January 2021, Version of Record 15 January 2021.

论文官网地址:https://doi.org/10.1016/j.is.2020.101710