Efficient machine learning on data science languages with parallel data summarization

摘要

Nowadays, data science analysts prefer “easy” high-level languages for machine learning computation like R and Python, but they present memory and speed limitations. Also, scalability is another issue when the data set size grows. On the other hand, acceleration of machine learning algorithms can be achieved with data summarization which has been a fundamental technique in data mining. With these motivations in mind, we present an efficient way to compute the statistical and machine learning models with parallel data summarization that can work with popular data science languages. Our summarization produces one or multiple summaries, accelerates a broader class of statistical and machine learning models, and requires a small amount of RAM. We present an algorithm that works in three phases and is capable to handle data sets bigger than the main memory. Our solution evaluates a vector–vector outer product with C++ code to escape the bottleneck of the high-level programming languages. We present an experimental evaluation section with a prototype in the R language where the summarization is programmed in C++. Our experiments prove that our solution can work on both data subsets and full data set without any performance penalty. Also, we compare our solution (R combined with C++) with other parallel big data systems: Spark (Spark-MLlib library), and a parallel DBMS (similar approach implemented with UDFs and SQL queries). We show our solution is simpler and mostly faster than Spark based on the storage of the data set, and it is much faster than a parallel DBMS regardless of the storage of the data set.