A fast parallel attribute reduction algorithm using Apache Spark

摘要

Effective and fast attribute reduction algorithm on high-dimensional dataset is one of the most important issues of big data, and several parallel attribute reduction algorithms were implemented by using MapReduce. However, MapReduce is not suitable for iterative computing, which causes low calculation efficiency in many cases. In this paper, we proposed a novel parallel attribute reduction algorithm by considering the new generation distributed computing framework Apache Spark. First, the core attribute decision strategy is proposed to replace the traditional attribute significance calculation, and the number of iterations is reduced fromCR−R2∕2+|R|∕2 to C (C represents the number of condition attributes and R represents the number of attributes in the reduct result). Furthermore, for high-dimensional datasets, we designed a batch processing strategy to reduce the number of iterations exponentially. Second, the proposed algorithm was speeded up with three techniques, including: (1) the network data transmission is minimized based on the localized operation; (2) a single cache iteration method is suggested to reduce disk I/O cost; (3) some calculations are skipped by an interruption strategy. In the experimental analysis, we succeeded with various types of real big datasets and random datasets in a real distributed computing environment and compared with the classic MapReduce-based parallel attribute reduction algorithm PAAR_PR in various aspects. Experimental conclusions proved that the computing efficiency of our algorithm has been improved by more than 98% compared to the classic parallel attribute reduction algorithm PAAR_PR.