Similarity-based second chance autoencoders for textual data

作者:Saria Goudarzvand, Gharib Gharibi, Yugyung Lee

摘要

Applying conventional autoencoders for textual data often results in learning trivial and redundant representations due to high text dimensionality, sparsity, and following power-law word distribution. To address these challenges, we propose two novel autoencoders, SCAT (Second Chance Autoencoder for Text) and SSCAT (Similarity-based SCAT). Our autoencoders utilize competitive learning among the k winner neurons in the bottleneck layer, which become specialized in recognizing specific patterns, leading to learning more semantically meaningful representations of textual data. In addition, the SSCAT model presents a novel competition based on a similarity measurement to eliminate redundant features. Our experiments prove that SCAT and SSCAT achieve high performance on several tasks, including classification, topic modeling, and document visualization, compared to LDA, k-Sparse, KATE, ProdLDA, NVCTM and ZeroShotTM.The experiments were conducted using the 20 Newsgroups, Wiki10+, and Reuters datasets.

论文关键词:Autoencoder, Topic modeling, Competitive learning, Representation learning

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-021-03100-z