Word Sense Disambiguation by Learning Decision Trees from Unlabeled Data

作者:Seong-Bae Park, Byoung-Tak Zhang, Yung Taek Kim

摘要

In this paper we describe a machine learning approach to word sense disambiguation that uses unlabeled data. Our method is based on selective sampling with committees of decision trees. The committee members are trained on a small set of labeled examples which are then augmented by a large number of unlabeled examples. Using unlabeled examples is important because obtaining labeled data is expensive and time-consuming while it is easy and inexpensive to collect a large number of unlabeled examples. The idea behind this approach is that the labels of unlabeled examples can be estimated by using committees. Using additional unlabeled examples, therefore, improves the performance of word sense disambiguation and minimizes the cost of manual labeling. Effectiveness of this approach was examined on a raw corpus of one million words. Using unlabeled data, we achieved an accuracy improvement up to 20.2%.

论文关键词:word sense disambiguation, learning from unlabeled examples, selective sampling, committee learning, decision tree

论文评审过程:

论文官网地址:https://doi.org/10.1023/A:1023812606045