PNRank: Unsupervised ranking of person name entities from noisy OCR text

Highlights：

• An automated process for ranking person name entities from noisy OCR data.

• The process involves development of a people gazetteer, generation of theory driven linguistic features and a novel unsupervised ranking model based on Kernel Density Estimation (KDE).

摘要

Text databases have grown tremendously in number, size, and volume over the last few decades. Optical Character Recognition (OCR) software is used to scan the text and make them available in online repositories. The OCR transcription process is often not accurate resulting in large volumes of garbled text in the repositories. Spell correction and other post-processing of OCR text often prove to be very expensive and time-consuming. While it is possible to rely on the OCR model to assess the quality of text in a corpus, many natural language processing and information retrieval tasks prefer the extrinsic evaluation of the effect of noise on the task at hand. This paper examines the effect of noise on the unsupervised ranking of person name entities by first populating a list of person names using an out-of-the-box Named Entity Recognition (NER) software, extracting content-based features for the identified entities, and ranking them using a novel unsupervised Kernel Density Estimation (KDE) based ranking algorithm. This generative model has the ability to learn rankings using the data distribution and therefore requires limited manual intervention. Empirical results are presented on a carefully curated parallel corpus of OCR and clean text and “in the wild” using a large real-world corpus. Experiments on the parallel corpus reveals that even with a reasonable degree of noise in the dataset, it is possible to generate ranked lists using the KDE algorithm with a high degree of precision and recall. Furthermore, since the KDE algorithm has comparable performance to state-of-the-art unsupervised rankers, using it on real-world corpora is feasible. The paper concludes by reflecting on other methods for enhancing the performance of the unsupervised algorithm on OCR text such as cleaning entity names, disambiguating names concatenated to one another and correcting OCR errors that are statistically significant in the corpus.