Big data time series forecasting based on nearest neighbours distributed computing with Spark

作者：

Highlights：

•

摘要

A new approach for big data forecasting based on the k-weighted nearest neighbours algorithm is introduced in this work. Such an algorithm has been developed for distributed computing under the Apache Spark framework. Every phase of the algorithm is explained in this work, along with how the optimal values of the input parameters required for the algorithm are obtained. In order to test the developed algorithm, a Spanish energy consumption big data time series has been used. The accuracy of the prediction has been assessed showing remarkable results. Additionally, the optimal configuration of a Spark cluster has been discussed. Finally, a scalability analysis of the algorithm has been conducted leading to the conclusion that the proposed algorithm is highly suitable for big data environments.

论文关键词：Big data,Spark,Time series forecasting

论文评审过程：Received 20 October 2017, Revised 13 July 2018, Accepted 15 July 2018, Available online 17 July 2018, Version of Record 31 October 2018.

论文官网地址：https://doi.org/10.1016/j.knosys.2018.07.026