Term extraction from sparse, ungrammatical domain-specific documents

作者:

Highlights:

摘要

Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers’ repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.

论文关键词:Term extraction,Natural language processing,Text mining,Business intelligence,Product development-customer service

论文评审过程:Available online 22 November 2012.

论文官网地址:https://doi.org/10.1016/j.eswa.2012.10.067