Efficient clustering-based source code plagiarism detection using PIY

作者:Tony Ohmann, Imad Rahal

摘要

Vast amounts of information available online make plagiarism increasingly easy to commit, and this is particularly true of source code. The traditional approach of detecting copied work in a course setting is manual inspection. This is not only tedious but also typically misses code plagiarized from outside sources or even from an earlier offering of the course. Systems to automatically detect source code plagiarism exist but tend to focus on small submission sets. One such system that has become the standard in automated source code plagiarism detection is measure of software similarity (MOSS) Schleimer et al. in proceedings of the 2003 ACM SIGMOD international conference on management of data, ACM, San Diego, 2003. In this work, we present an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy. By utilizing parallel processing and data clustering, PIY is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.

论文关键词:Plagiarism detection, Data clustering, \(k\)-Grams, Parallel computing, NUMA

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-014-0742-2