Hidden Markov Models for Text Categorization in Multi-Page Documents

作者：Paolo Frasconi, Giovanni Soda, Alessandro Vullo

摘要

In the traditional setting, text categorization is formulated as a concept learning problem where each instance is a single isolated document. However, this perspective is not appropriate in the case of many digital libraries that offer as contents scanned and optically read books or magazines. In this paper, we propose a more general formulation of text categorization, allowing documents to be organized as sequences of pages. We introduce a novel hybrid system specifically designed for multi-page text documents. The architecture relies on hidden Markov models whose emissions are bag-of-words resulting from a multinomial word event model, as in the generative portion of the Naive Bayes classifier. The rationale behind our proposal is that taking into account contextual information provided by the whole page sequence can help disambiguation and improves single page classification accuracy. Our results on two datasets of scanned journals from the Making of America collection confirm the importance of using whole page sequences. The empirical evaluation indicates that the error rate (as obtained by running the Naive Bayes classifier on isolated pages) can be significantly reduced if contextual information is incorporated.

论文关键词：text categorization, multi-page documents, hidden Markov models, Naive Bayes, digital libraries

论文评审过程：

论文官网地址：https://doi.org/10.1023/A:1013681528748