String alignment for automated document versioning

作者:Wei Lee Woon, Kuok-Shoong Daniel Wong

摘要

The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.

论文关键词:String matching, Text processing, Data mining, Versioning, Information retrieval

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-008-0130-x