Transformer-based networks over tree structures for code classification

摘要

In software engineering (SE), code classification and related tasks, such as code clone detection are still challenging problems. Due to the elusive syntax and complicated semantics in software programs, existing traditional SE approaches still have difficulty differentiating between the functionalities of code snippets at the semantic level with high accuracy. As artificial intelligence (AI) techniques have increased in recent years, exploring different machine/deep learning techniques for code classification algorithms has become important. However, most existing machine/deep learning-based approaches often consider using convolutional neural networks (CNNs) or recurrent neural networks (RNNs) to process code texts. However, the two networks inevitably suffer from gradient vanishing problems and fail to capture long-distance dependencies from code statements, resulting in poor performance in downstream tasks. In this paper, we propose the TBCC (Transformer-Based Code Classifier), a novel transformer-based neural network for programming language processing, which can avoid these two problems. Moreover, to capture the important syntactical features from programming languages, we split the deep abstract syntax trees (ASTs) into smaller subtrees that, aim to exploit syntactical information in code statements. We have applied the TBCC to two different common program comprehension tasks to verify its effectiveness: a code classification task for C programs and a code clone detection task for Java programs. The experimental results show that the TBCC achieves state-of-the-art performance, outperforming the baseline methods in terms of accuracy, recall, and F1 score. For subsequent research, the code of TBCC has been released ∗.