Self-taught dimensionality reduction on the high-dimensional small-sized data

摘要

To build an effective dimensionality reduction model usually requires sufficient data. Otherwise, traditional dimensionality reduction methods might be less effective. However, sufficient data cannot always be guaranteed in real applications. In this paper we focus on performing unsupervised dimensionality reduction on the high-dimensional and small-sized data, in which the dimensionality of target data is high and the number of target data is small. To handle the problem, we propose a novel Self-taught Dimensionality Reduction (STDR) approach, which is able to transfer external knowledge (or information) from freely available external (or auxiliary) data to the high-dimensional and small-sized target data. The proposed STDR consists of three steps: First, the bases are learnt from sufficient external data, which might come from the same “type” or “modality” of target data. The bases are the common part between external data and target data, i.e., the external knowledge (or information). Second, target data are reconstructed by the learnt bases by proposing a novel joint graph sparse coding model, which not only provides robust reconstruction ability but also preserves the local structures amongst target data in the original space. This process transfers the external knowledge (i.e., the learnt bases) to target data. Moreover, the proposed solver to the proposed model is theoretically guaranteed that the objective function of the proposed model converges to the global optimum. After this, target data are mapped into the learnt basis space, and are sparsely represented by the bases, i.e., represented by parts of the bases. Third, the sparse features (that is, the rows with zero (or small) values) of the new representations of target data are deleted for achieving the effectiveness and the efficiency. That is, this step performs feature selection on the new representations of target data. Finally, experimental results at various types of datasets show the proposed STDR outperforms the state-of-the-art algorithms in terms of k-means clustering performance.