Duplicate document detection by template matching

作者:

Highlights:

摘要

We discuss some operational issues pertaining to the detection of duplicates in the databases of bitmapped binary document images, and reason that efficient and effective duplicate document detection probably needs a combination of an efficient primary detector and an effective subordinate detector to be achieved. An algorithm that executes binary pattern template matching by cross-correlation is proposed as a duplicate document detection methodology. The template matching operation is amenable to pixel-parallel computation on serial architecture computers by bitwise integer operations. A description of the algorithm is accompanied by a discussion of issues that arise in its practical implementation. Duplicate detection by template matching is especially well suited to facsimile (i.e. fax) databases, in particular for detecting the single feed-multiple transmissions that often dominate the occurrence of duplicates in fax databases. Detailed experimental results presented for fax documents demonstrate that template matching is suitable as both a primary detector when conducted with small template and search area sizes, and a subordinate detector when conducted with moderate template and search area sizes.

论文关键词:Template matching,Correlation,Literal similarity,Binary pattern recognition,Duplicate document detection,Facsimile document analysis

论文评审过程:Received 27 January 1999, Revised 2 December 1999, Accepted 2 December 1999, Available online 8 March 2000.

论文官网地址:https://doi.org/10.1016/S0262-8856(99)00086-4