Automatically classifying source code using tree-based approaches

作者:

Highlights:

摘要

Analyzing source code to solve software engineering problems such as fault prediction, cost, and effort estimation always receives attention of researchers as well as companies. The traditional approaches are based on machine learning, and software metrics obtained by computing standard measures of software projects. However, these methods have faced many challenges due to limitations of using software metrics which were not enough to capture the complexity of programs.To overcome the limitations, this paper aims to solve software engineering problems by exploring information of programs' abstract syntax trees (ASTs) instead of software metrics. We propose two combination models between a tree-based convolutional neural network (TBCNN) and k-Nearest Neighbors (kNN), support vector machines (SVMs) to exploit both structural and semantic ASTs' information. In addition, to deal with high-dimensional data of ASTs, we present several pruning tree techniques which not only reduce the complexity of data but also enhance the performance of classifiers in terms of computational time and accuracy.We survey many machine learning algorithms on different types of program representations including software metrics, sequences, and tree structures. The approaches are evaluated based on classifying 52000 programs written in C language into 104 target labels. The experiments show that the tree-based classifiers dramatically achieve high performance in comparison with those of metrics-based or sequences-based; and two proposed models TBCNN + SVM and TBCNN + kNN rank as the top and the second classifiers. Pruning redundant AST branches leads to not only a substantial reduction in execution time but also an increase in accuracy.

论文关键词:Abtract Syntax Tree (AST),Tree-based convolutional neural networks(TBCNN),Support Vector Machines (SVMs),K-Nearest Neighbors (kNN)

论文评审过程:Received 20 January 2017, Revised 11 July 2017, Accepted 17 July 2017, Available online 27 July 2017, Version of Record 31 March 2018.

论文官网地址:https://doi.org/10.1016/j.datak.2017.07.003