Analysis of training data using clustering to improve semi-supervised self-training

作者：

Highlights：

• Highlights

• The analysis of the sufficiency of labeled data for semi-supervised self-training is important but has not received much attention.

• We proposed methods to analyze and to improve the sufficiency of the labeled data.

• We apply cluster analysis for measuring and improving the labeled dataset.

• Two methods, i.e., active labeling and co-labeling, are proposed to ensure a sufficient number of class labels for self-training.

• Extensive experiments have shown that the proposed methods significantly improved the accuracy of semi-supervised classification.

摘要

Highlights•The analysis of the sufficiency of labeled data for semi-supervised self-training is important but has not received much attention.•We proposed methods to analyze and to improve the sufficiency of the labeled data.•We apply cluster analysis for measuring and improving the labeled dataset.•Two methods, i.e., active labeling and co-labeling, are proposed to ensure a sufficient number of class labels for self-training.•Extensive experiments have shown that the proposed methods significantly improved the accuracy of semi-supervised classification.

论文关键词：Semi-supervised classification,Self-training,Cluster analysis,Semi-supervised clustering,Active learning

论文评审过程：Received 27 June 2017, Revised 2 December 2017, Accepted 7 December 2017, Available online 8 December 2017, Version of Record 3 February 2018.

论文官网地址：https://doi.org/10.1016/j.knosys.2017.12.006