Multilingual novelty detection

作者:

Highlights:

摘要

Novelty detection aims at reducing redundant information from a chronologically ordered list of documents or sentences. Other studies of novelty detection have been conducted on the English language, but few papers have addressed the problem of multilingual novelty detection. Likewise, research in multilingual information retrieval have rarely been applied to novelty detection. This paper attempts to bridge the two disciplines by first describing the preprocessing steps for English, Malay and Chinese, then applying document and sentence-level novelty detection for the three languages on APWSJ and TREC 2004 Novelty Track data. Experiments on sentence-level novelty detection show similar results for all three languages, which indicates that our algorithm is suitable for multilingual novelty detection at the sentence level. However, results for document-level novelty detection show a disparity across the different languages, with English and Malay outperforming Chinese. After applying sentence-level novelty detection to detect novel documents, we observe substantial improvements on all three languages. This demonstrates that segmenting documents into sentences improves document-level novelty detection in multiple languages, and has practical benefits for a real-time multilingual novelty detection system.

论文关键词:Novelty detection,Multilingual,Stemming,POS tagging,Malay,Chinese

论文评审过程:Available online 16 July 2010.

论文官网地址:https://doi.org/10.1016/j.eswa.2010.07.016