An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams
作者:Mohammad Javad Hosseini, Ameneh Gholipour, Hamid Beigy
摘要
Recent advances in storage and processing have provided the possibility of automatic gathering of information, which in turn leads to fast and continuous flows of data. The data which are produced and stored in this way are called data streams. Data streams are produced in large size, and much dynamism and have some unique properties which make them applicable to model many real data mining applications. The main challenge of streaming data is the occurrence of concept drift. In addition, regarding the costs of labeling of instances, it is often assumed that only a small fraction of instances are labeled. In this paper, we propose an ensemble algorithm to classify instances of non-stationary data streams in a semi-supervised environment. Furthermore, this method is intended to recognize recurring concept drifts of data streams. In the proposed algorithm, a pool of classifiers is maintained by the algorithm with each classifier being representative of one single concept. At first, a batch of instances is classified by the algorithm. Thereafter, some of these instances are labeled and this partially labeled batch is used to update the classifiers in the pool. This process repeats for consecutive batches of the streams. The main advantage of the algorithm is that it uses unlabeled instances as well as labeled ones in the learning task. Experimental results show the effectiveness of the proposed algorithm over the state-of-the-art methods, in different aspects.
论文关键词:Semi-supervised learning, Ensemble learning, Data streams, Concept drift, Cluster assumption
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-015-0837-4