Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

作者:

Highlights:

摘要

In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as α to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov–Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found α = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values.

论文关键词:Spam detection,Binary Particle Swarm Optimization,Mutation operator,Feature selection,Wrapper,Premature convergence,Decision tree,Cost matrix

论文评审过程:Received 25 November 2013, Revised 17 March 2014, Accepted 22 March 2014, Available online 1 April 2014.

论文官网地址:https://doi.org/10.1016/j.knosys.2014.03.015