Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction

摘要

Software Defect Prediction (SDP) is highly crucial task in software development process to forecast about which modules are more prone to errors and faults before the instigation of the testing phase. It aims to reduce the development cost of the software by focusing the testing efforts to those predicted faulty modules. Though, it ensures in-time delivery of good quality end-product, but class-imbalance of dataset is a major hinderance to SDP. This paper proposes a novel Neighbourhood based Under-Sampling (N-US) algorithm to handle class imbalance issue. This work is dedicated to demonstrating the effectiveness of proposed Neighbourhood based Under-Sampling (N-US) approach to attain high accuracy while predicting the defective modules. The algorithm N-US under samples the dataset to maximize the visibility of minority data points while restricting the excessive elimination of majority data points to avoid information loss. To assess the applicability of N-US, it is compared with three standard under-sampling techniques. Further, this study investigates the performance of N-US as a trusted ally for SDP classifiers. Extensive experiments are conducted using benchmark datasets from NASA repository which are CM1, JM1, KC1, KC2 and PC1. The proposed SDP classifier with N-US technique is compared with baseline models statistically to assess the effectiveness of N-US algorithm for SDP. The proposed model outperforms the rest of the candidate SDP models with the highest AUC score (= 95.6%), the maximum Accuracy value (= 96.9%) and the closest ROC curve to the top left corner. It shows up with the best prediction power statistically with confidence level of 95%.