Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification

摘要

In many classification cases, the labeled samples are difficult to acquire. However, the unlabeled samples are easy to obtain. Active learning (AL) technology can be used to resolve the labeling problem. Among numerous kinds of AL algorithms, the one that focuses on labeling the unlabeled samples within the margin band of SVM is an effective way to decrease manual labeling workload. AL needs human involvement, but the time and energy which human can provide is often limited. Therefore, there is a big restriction for sample labeling based on the AL technology. To this end, the motivation of this work is to do studies on the processing after the AL process. For the AL algorithm which focuses on exploring the unlabeled samples within the margin band of SVM, after it stops, we aim for investigating whether such unlabeled samples can continue to be explored by semi-supervised learning (SSL) or not. To design such SSL algorithm, one of the challenges is how to figure out unlabeled samples’ confidence, and then select the ones with high confidence. In this work, we proposed 3 criterions to determine confidence, i.e. 1) the smoothness assumption; 2) the explored positive samples and the explored negative samples should be similar to the labeled positive samples and the labeled negative samples as much as possible, respectively; 3) the explored positive samples and the explored negative samples should be different from the labeled negative samples and the labeled positive samples as much as possible, respectively. Based on these 3 criterions, a SSL algorithm—SSL_3C was proposed in this work. Furthermore, we applied SSL_3C to audio event classification field, and did experiments on two public datasets. Experimental results demonstrate that SSL_3C can improve the classification performance after the AL process effectively. The selected unlabeled samples are not only of high confidence, but also very informative. Moreover, SSL_3C is not sensitive to the size of labeled and unlabeled training set. The contributions of this work lie in two aspects: first, for the unlabeled samples within the margin band of SVM, we have proposed an effective SSL algorithm to explore them; second, we innovatively proposed 3 criterions to determine unlabeled samples’ confidence. Based on these 3 criterions, the explored unlabeled samples are not only of high confidence, but also very informative. Since labeling problem exists in many classification fields, and SSL_3C can effectively decrease manual labeling workload, then the proposed SSL_3C should find widespread applications in many other fields.