An information-theoretic perspective of tf–idf measures

作者:

Highlights:

摘要

This paper presents a mathematical definition of the “probability-weighted amount of information” (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency–inverse document frequency measures that are commonly used in today’s information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.

论文关键词:tf–idf,Term weighting theories,Information theory,Text categorization

论文评审过程:Received 4 August 2001, Accepted 4 January 2002, Available online 4 September 2002.

论文官网地址:https://doi.org/10.1016/S0306-4573(02)00021-3