Empirical studies on the impact of lexical resources on CLIR performance

作者:

Highlights:

摘要

In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include:•One can achieve an acceptable CLIR performance using only a bilingual term list (70–80% on Chinese and Arabic corpora).•However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance.•If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text.•While stemming is useful normally, with a very large parallel corpus for Arabic–English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.

论文关键词:Cross-lingual retrieval,Parallel texts,Stemming,Machine translation,Bilingual lexicons

论文评审过程:Received 10 June 2004, Accepted 14 June 2004, Available online 20 August 2004.

论文官网地址:https://doi.org/10.1016/j.ipm.2004.06.009