Sample cutting method for imbalanced text sentiment classification based on BRC

作者:

Highlights:

摘要

The vast subjective texts spreading all over the Internet promoted the demand for text sentiment classification technology. A well-known fact that often weakens the performance of classifiers is the distribution imbalance of review texts on the positive–negative classes. In this paper, we pay attention to the sentiment classification problem of imbalanced text sets. With regards to this problem, the algorithm BRC for clarifying the disorder boundary is proposed by cutting the majority class samples in the dense boundary region. The classifier is constructed based on Support Vector Machine. In order to find the better feature weight scheme, combination strategy of sample cutting, and parameters in BRC, three groups of experiments are designed on six text sets about five domains. The experimental results show that the feature weight scheme Presence has the best performance. And the combination strategy BRC + RS can give a tradeoff between the evaluation measures, Precision and Recall on two categories and make the synthetical evaluation measure Accuracy obtain a larger increase. It should be noted that the method of determining the parameters α and β in BRC is empirical.Although the boundary region cutting algorithm BRC is aimed to text sentiment classification we believe that it is also suitable to any two-category classification problem with imbalanced sample data.

论文关键词:Imbalanced text set,Text sentiment classification,Sample cutting algorithm,Boundary region,Feature weight

论文评审过程:Received 17 October 2011, Revised 25 July 2012, Accepted 11 September 2012, Available online 3 October 2012.

论文官网地址:https://doi.org/10.1016/j.knosys.2012.09.003