Incorporation of the age of a document into the retrieval process

摘要

A full treatment of the significance of a document for an enquirer should include a joint description of the similarity between the document and the enquiry in a linquistic sense, and the age of the document at the time of the enquiry. The basic variables are identified in terms of a signal detection model. The age variable is related to the phenomenon of obsolescence, which is treated as a perceived, signed attribute of relevant documents. Two retrieval methods that use both index terms and document age are described: one in which a set of documents, first selected by a term-intersection process, is reduced by applying a date of publication criterion (the “subset method”); and one in which a bivariate function attaches a single number to each document, and a retrieved set is defined by a single threshold value (the “bivariate weight method”). In the latter method, discriminant analysis is a useful aid. A model of the retrieval process, based on continuous variables, is described, and the effectiveness of each method is predicted, both in terms of the Precision-Recall graph and language measures. The model suggests that either method can improve retrieval performance but incorrect usage will depress it. The better choice of method will depend on the Recall/Precision mix required by the user, as well as the actual parameters of the distributions. A relationship is hypothesised between the growth rate of a data base and the underlying distributions defined by relevance judgements.