MEFASD-BD: Multi-objective evolutionary fuzzy algorithm for subgroup discovery in big data environments - A MapReduce solution

作者:

Highlights:

摘要

Nowadays, there is an incredible increase of data volumes around the world, with the Internet as one of the main actors in this scenario and a growth rate above 30GB/s. The treatment of this huge amount of information cannot be carried out through traditional data mining algorithms in an efficient way and it is necessary to adapt and design new algorithms towards distributed paradigms such as MapReduce. This situation is a challenge for the community, investigated under the widely known term of big data.This paper presents a new algorithm for the subgroup discovery task called MEFASD-BD. The algorithm is developed in Apache Spark based on the MapReduce paradigm, and it is able to tackle high dimensional datasets in an efficient way. In fact, this algorithm is the first approximation to big data within evolutionary fuzzy systems for subgroup discovery. MEFASD-BD implements novel MapReduce functions which are able to analyse the quality of the subgroups obtained for each map with respect to the original dataset in order to improve the quality of these subgroups. In addition, the final reduce function of the algorithm employs the token competition operator in order to select the best rules extracted in the different maps. An experimental study with high dimensional datasets is performed in order to show the advantages of this algorithm in this type of problems. Specifically, the results of the study show an important reduction of the runtime while keeping the values in the standard quality measures for subgroup discovery.

论文关键词:Subgroup discovery,Big data,Multi-objective evolutionary fuzzy systems,Apache Spark,MapReduce

论文评审过程:Received 30 March 2016, Revised 17 August 2016, Accepted 24 August 2016, Available online 24 August 2016, Version of Record 20 December 2016.

论文官网地址:https://doi.org/10.1016/j.knosys.2016.08.021