An unsupervised approach for learning a Chinese IS-A taxonomy from an unstructured corpus

作者:

Highlights:

摘要

Taxonomies play an important role in various Natural Language Processing (NLP) tasks (e.g., text classification, information extraction and knowledge inference). However, due to the complexity and flexibility of the Chinese natural language, it is challenging to accurately learn a Chinese IS-A (C-IS-A) taxonomy from an unstructured corpus. In this paper, we propose an unsupervised C-IS-A taxonomy learning approach by analyzing a given unstructured corpus. Our approach uses three main steps to automatically learn a C-IS-A taxonomy. First, our approach extracts high-quality C-IS-A seed relations via semantic iterative pattern-based matching and syntactic methods. Second, our approach utilizes an unsupervised taxonomic semantic-clique-based method to increase the coverage of the C-IS-A taxonomy. As the core component of our approach, we exploit the extracted C-IS-A seed relations to construct taxonomic semantic cliques and use the context of the cliques and multi-concept co-occurrence information to infer potential novel C-IS-A relations. Last, a two-step relation detection strategy is proposed to remove potentially incorrect C-IS-A relations, which can substantially improve the accuracy of the learned taxonomy. We implement our approach on four Chinese unstructured corpora and evaluate it in terms of precision, coverage, time cost and the effects of the subcomponents. The evaluation results demonstrate that our approach is an effective method that outperforms the state-of-the-art compared approaches.

论文关键词:Chinese taxonomy learning,Taxonomic semantic clique,Relation inference,Error detection

论文评审过程:Received 21 September 2018, Revised 17 July 2019, Accepted 19 July 2019, Available online 29 July 2019, Version of Record 9 September 2019.

论文官网地址:https://doi.org/10.1016/j.knosys.2019.07.032