An improved and random synthetic minority oversampling technique for imbalanced data

作者:

Highlights:

摘要

Imbalanced data learning has become a major challenge in data mining and machine learning. Oversampling is an effective way to re-achieve the balance by generating new samples. However, most oversampling methods cannot perform well in the presence of noises and complicated distribution structures, very easy to generate redundant/unsafe/outlier samples. To handle this problem, we endeavor to propose a novel oversampling method, namely Improved and Random Synthetic Minority Oversampling Technique (IR-SMOTE). The core idea of IR-SMOTE is three-fold: (1) by applying an ascending operation to sort the majority class samples, noise samples in each cluster of minority class after k-means clustering are successfully removed; (2) the number of synthetic samples is adaptively assigned to each cluster in minority class by means of the kernel density estimation technique; and (3) based on the obtained attributes of the temporary synthetic samples in terms of random-SMOTE, a new synthesizing method is developed to generate new samples with a guaranteed diversity. Finally, many comparison experiments have been carried out on 18 well-known data sets, which illustrate the effectiveness and universal applicability of the proposed IR-SMOTE method for imbalanced data classification.

论文关键词:Imbalanced data,Improved and random SMOTE,K-means algorithm,Synthesis strategy,Kernel density estimation

论文评审过程:Received 7 September 2021, Revised 15 March 2022, Accepted 14 April 2022, Available online 26 April 2022, Version of Record 4 May 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.108839