A fast parallel attribute reduction algorithm using Apache Spark

作者:

Highlights:

摘要

Effective and fast attribute reduction algorithm on high-dimensional dataset is one of the most important issues of big data, and several parallel attribute reduction algorithms were implemented by using MapReduce. However, MapReduce is not suitable for iterative computing, which causes low calculation efficiency in many cases. In this paper, we proposed a novel parallel attribute reduction algorithm by considering the new generation distributed computing framework Apache Spark. First, the core attribute decision strategy is proposed to replace the traditional attribute significance calculation, and the number of iterations is reduced fromCR−R2∕2+|R|∕2 to C (C represents the number of condition attributes and R represents the number of attributes in the reduct result). Furthermore, for high-dimensional datasets, we designed a batch processing strategy to reduce the number of iterations exponentially. Second, the proposed algorithm was speeded up with three techniques, including: (1) the network data transmission is minimized based on the localized operation; (2) a single cache iteration method is suggested to reduce disk I/O cost; (3) some calculations are skipped by an interruption strategy. In the experimental analysis, we succeeded with various types of real big datasets and random datasets in a real distributed computing environment and compared with the classic MapReduce-based parallel attribute reduction algorithm PAAR_PR in various aspects. Experimental conclusions proved that the computing efficiency of our algorithm has been improved by more than 98% compared to the classic parallel attribute reduction algorithm PAAR_PR.

论文关键词:Rough sets,Big data,Parallel algorithm,Attribute reduction,Apache Spark

论文评审过程:Received 19 May 2020, Revised 20 September 2020, Accepted 29 October 2020, Available online 11 November 2020, Version of Record 24 December 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106582