A language model based on semantically clustered words in a Chinese character recognition system

作者：

Highlights：

•

摘要

This paper presents a new method for clustering the words in a dictionary into word groups. A Chinese character recognition system can then use these groups in a language model to improve the recognition accuracy. In the language model, the number of parameters we must train beforehand can be kept to a reasonable value. The Chinese synonym dictionary Tong2yi4ci2 ci2lin2 providing the semantic features is used to calculate the weights of the semantic attributes of the character-based word classes. The weights of the semantic attributes are next updated according to the words of the Behavior dictionary, which has a rather complete word set. Then, the word classes are clustered to m groups according to the semantic measurement by a greedy method. The words in the Behavior dictionary can finally be assigned to the m groups. The parameter space for the bigram contextual information of the character recognition system is m2. From the experimental results, the recognition system with the proposed model has shown better performance than that of a character-based bigram language model.

论文关键词：Contextual postprocessing,Language model,Semantics,Word group

论文评审过程：Received 16 July 1996, Available online 7 June 2001.

论文官网地址：https://doi.org/10.1016/S0031-3203(96)00154-9