Using TREC for developing semantic information retrieval benchmark for Urdu

作者：

Highlights：

• Large corpus of over 2,887,169 Urdu documents in the TREC defined SGML format.

• A collection of 35 Urdu queries from 14 domains for assessment.

• Human benchmark of candidate relevant documents using a pooling-based.

• Non-binary relevance judgement at four-levels.

摘要

•Large corpus of over 2,887,169 Urdu documents in the TREC defined SGML format.•A collection of 35 Urdu queries from 14 domains for assessment.•Human benchmark of candidate relevant documents using a pooling-based.•Non-binary relevance judgement at four-levels.

论文关键词：Information Retrieval,Benchmark dataset,Urdu news documents,Non-binary ranking,Urdu language processing,Information retrieval queries,Text REtrieval Conference (TREC)

论文评审过程：Received 20 December 2021, Revised 27 March 2022, Accepted 3 April 2022, Available online 30 April 2022, Version of Record 30 April 2022.

论文官网地址：https://doi.org/10.1016/j.ipm.2022.102939