A distance measure for automatic document classification by sequential analysis

作者:

Highlights:

摘要

This research has investigated the feasibility of using a distance measure, called the Bayesian distance, for automatic sequential document classification. It has been shown that by observing the variation of this distance measure as keywords are extracted sequentially from a document, the occurrence of noisy keywords may be detected. This property of the distance measure has been utilized to design a sequential classification algorithm which works in two phases. In the first phase keywords extracted from a document are partitioned into two groups—the good keyword group and the noisy keyword group. In the second phase these two groups of keywords are analyzed separately to assign primary and secondary classes to a document. The algorithm has been applied to several data bases of documents and very encouraging results have been obtained.

论文关键词:

论文评审过程:Received 5 April 1977, Revised 3 November 1977, Available online 13 July 2002.

论文官网地址:https://doi.org/10.1016/0306-4573(78)90063-8