GeSe: Generalized static embedding | 数据学习(DataLearner)

摘要

In natural language processing, most text representation methods can be generally categorized into two paradigms: static and dynamic. Both have distinctive advantages, which are reflected in the cost of training resources, the scale of input data, and the interpretability of the representation model. Dynamic representation methods, such as BERT, have achieved excellent results on many tasks based on expensive pre-training. However, this representation paradigm is black-box, and the intrinsic properties cannot be measured by standard word similarity and analogy benchmarks. Most importantly, it is not in all cases that there are adequate resources and unlimited data to use. While static methods are solid alternatives for these scenarios because they can be efficiently trained with limited resources, keeping straightforward interpretability and verifiable intrinsic properties. Although many static embedding methods have been proposed, few attempts have been made to investigate the connections between these algorithms. Thus, it is natural to ask which implementation is more efficient, and is there any way to combine the merits of these algorithms into a generalized framework? In this paper, we try to explore answers to these questions by focusing on two popular static embedding models, Continual-Bag-of-Words (CBOW) and Skip-gram (SG), with detailed analysis of their merits and drawbacks under both Negative Sampling (NS) and Hierarchy Softmax (HS) settings. Then, we propose a novel learning framework to train generalized static embeddings in a unified architecture. Our proposed method is estimator-agnostic. Thus, it can be optimized by either NS, HS, or any other equivalent estimators. Experiments show that embeddings learned from the proposed framework outperform strong baselines on standard intrinsic evaluations. We also test the proposed method on three extrinsic tasks. Empirical results show that the proposed method achieves considerable improvements across all these tasks.