Similarity-based second chance autoencoders for textual data

作者：Saria Goudarzvand, Gharib Gharibi, Yugyung Lee

摘要

Applying conventional autoencoders for textual data often results in learning trivial and redundant representations due to high text dimensionality, sparsity, and following power-law word distribution. To address these challenges, we propose two novel autoencoders, SCAT (Second Chance Autoencoder for Text) and SSCAT (Similarity-based SCAT). Our autoencoders utilize competitive learning among the k winner neurons in the bottleneck layer, which become specialized in recognizing specific patterns, leading to learning more semantically meaningful representations of textual data. In addition, the SSCAT model presents a novel competition based on a similarity measurement to eliminate redundant features. Our experiments prove that SCAT and SSCAT achieve high performance on several tasks, including classification, topic modeling, and document visualization, compared to LDA, k-Sparse, KATE, ProdLDA, NVCTM and ZeroShotTM.The experiments were conducted using the 20 Newsgroups, Wiki10+, and Reuters datasets.

论文关键词：Autoencoder, Topic modeling, Competitive learning, Representation learning

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10489-021-03100-z