Complete tolerance relation based parallel filling for incomplete energy big data

摘要

With the approaching of cloud and big data computing era, renewable energy such as solar energy is increasingly integrated into data center power provisioning systems. Nevertheless, the power statistics collection may not be possible or available due to the fact that renewable energy supply exhibits intermittency, time varying behavior (e.g. shortage or failure), resulting in missing data. In this paper, we propose a filling algorithm based on complete tolerance class to solve the missing of energy big data issue. Note that traditional method based on rough sets will likely to fail when there is severe missing data, and its solution on tolerance relation and tolerance class is more complex, which is not suitable for the large scale and the time varying energy big data. Our proposed algorithm expands the tolerance relation into the complete tolerance relation to partition the complete tolerance class. Moreover, our algorithm fills the missing attribute values of the energy big data in data center, which ensures the data integrity and improve the classification accuracy. We further parallelize and optimize our algorithm on state-of-the-art Spark cluster computing framework.In addition, we propose the adaptive management architecture that handles incomplete energy big data in green data centers. Our proposed architecture integrates the techniques for preprocessing energy data, filling incomplete energy data and building decision model. It increases the power assignment efficiency between solar power and utility, while enhancing load performance and service availability. As a result, it can provide better service for green data centers. We perform comprehensive experiments on an energy data set and the results show the Completing Incomplete Big Data (CIBD) algorithm can guarantee the completeness of data while improving the filling accuracy by 10% compared to general filling algorithms such as MEAN or ERS. The proposed algorithm and architecture show more benefit as the data missing rate increases. We further utilize the filled data to establish the random forest model and yield desirable results. Compared to the Hadoop based filling algorithm, the processing speed of the CIBD algorithm improves by 50% on the 4GB data size.