Learning robust word representation over a semantic manifold

作者：

Highlights：

•

摘要

The performance of the traditional Word2Vec model heavily depends on the quality and quantity of the corpus, which violates the way of the humans learn. To understand word meaning, human beings prefer a two-stage learning process. That is, reading a Linguist-compiled dictionary as well as doing reading comprehension. These two stages complement each other. Traditional Word2Vec is an analogy of reading comprehension. While the first stage, learning the semantic rules from a language dictionary, such as the knowledge of thesaurus and etymology, is usually ignored by existing methods. In this work, we propose a robust word embedding learning framework by imitating the two-stage human learning process. In particular, we construct a semantic manifold based on the thesaurus and etymology to approximate the first stage. Then, we regularize the second stage (Word2Vec model) with this semantic manifold. We train the proposed model on three corpora (Wikipedia, enwik9 and text8). The experimental results demonstrate that the proposed method learns much smoother vector representations. Also, the performance on learning word embedding is robust even when the method is trained with a very simple corpus.

论文关键词：Distributed word embedding,Natural language processing,Manifold assumption

论文评审过程：Received 16 September 2018, Revised 3 December 2019, Accepted 6 December 2019, Available online 13 December 2019, Version of Record 24 February 2020.

论文官网地址：https://doi.org/10.1016/j.knosys.2019.105358