Estimating term domain relevance through term frequency, disjoint corpora frequency - tf-dcf

作者:

Highlights:

摘要

This paper proposes a new relevance index for terms extracted from domain corpora. We call it term frequency, disjoint corpora frequency (tf-dcf), and it is based on the absolute frequency of each term tempered by its frequency in other (contrasting) corpora. Conceptual differences and mathematical computation of the proposed index are discussed in respect with other similar approaches that also take contrasting corpora into account. To illustrate the efficiency of our index, this paper evaluates tf-dcf against other similar approaches. Finally, other experiments are made in order to analyze the tf-dcf behavior according to the characteristics of contrasting corpora.

论文关键词:Term weighting,Information retrieval,Automatic term extraction,Natural language processing

论文评审过程:Received 24 February 2015, Revised 19 December 2015, Accepted 23 December 2015, Available online 11 January 2016, Version of Record 20 February 2016.

论文官网地址:https://doi.org/10.1016/j.knosys.2015.12.015