The bootstrapping of the Yarowsky algorithm in real corpora

作者:

Highlights:

摘要

The Yarowsky bootstrapping algorithm resolves the homograph-level word sense disambiguation (WSD) problem, which is the sense granularity level required for real natural language processing (NLP) applications. At the same time it resolves the knowledge acquisition bottleneck problem affecting most WSD algorithms and can be easily applied to foreign language corpora. However, this paper shows that the Yarowsky algorithm is significantly less accurate when applied to domain fluctuating, real corpora. This paper also introduces a new bootstrapping methodology that performs much better when applied to these corpora. The accuracy achieved in non-domain fluctuating corpora is not reached due to inherent domain fluctuation ambiguities.

论文关键词:Word sense disambiguation,Polysemy,Homograph,Knowledge acquisition bottleneck,Domain fluctuating corpora,Bootstrapping,Semi-supervised learning

论文评审过程:Received 14 January 2008, Revised 29 May 2008, Accepted 16 July 2008, Available online 30 August 2008.

论文官网地址:https://doi.org/10.1016/j.ipm.2008.07.002