T-norms or t-conorms? How to aggregate similarity degrees for plagiarism detection

作者:

Highlights:

摘要

Making correct decisions as to whether code chunks should be considered similar becomes increasingly important in software design and education and not only can improve the quality of computer programs, but also help assure the integrity of student assessments. In this paper we test numerous source code similarity detection tools on pairs of code fragments written in the data science-oriented functional programming language R. Contrary to mainstream approaches, instead of considering symmetric measures of “how much code chunks A and B are similar to each other”, we propose and study the nonsymmetric degrees of inclusion “to what extent A is a subset of B” and “to what degree B is included in A”. Overall, t-norms yield better precision (how many suspicious pairs are actually similar), t-conorms maximise recall (how many similar pairs are successfully retrieved), and custom aggregation functions fitted to training data provide a good balance between the two. Also, we find that program dependence graph-based methods tend to outperform those relying on normalised source code text, tokens, and names of functions invoked.

论文关键词:Fuzzy logic connectives,Similarity aggregation,Decision making,Data-driven optimisation,R language

论文评审过程:Received 15 December 2020, Revised 19 August 2021, Accepted 19 August 2021, Available online 24 August 2021, Version of Record 27 August 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107427