Hamshahri: A standard Persian text collection

作者:

Highlights:

摘要

The Persian language is one of the dominant languages in the Middle East, so there are significant amount of Persian documents available on the Web. Due to the different nature of the Persian language compared to the other languages such as English, the design of information retrieval systems in Persian requires special considerations. However, there are relatively few studies on retrieval of Persian documents in the literature and one of the main reasons is the lack of a standard test collection. In this paper, we introduce a standard Persian text collection, named Hamshahri, which is built from a large number of newspaper articles according to TREC specifications. Furthermore, statistical information about documents, queries and their relevance judgments are presented in this paper. We believe that this collection is the largest Persian text collection, so far.

论文关键词:Persian test collection,Farsi text retrieval,Persian information retrieval

论文评审过程:Received 12 February 2008, Revised 5 March 2009, Accepted 3 May 2009, Available online 10 May 2009.

论文官网地址:https://doi.org/10.1016/j.knosys.2009.05.002