An effective High Recall Retrieval method

作者:

Highlights:

摘要

The High Recall Retrieval (HRR) problem is one of the fundamental tasks for many applications such as patent retrieval, legal search, medical search, marketing research, charging and collecting tax, and literature review, etc. Given the data set obtained by the user’s query, the HRR problem is defined as finding the full set of relevant documents while less review effort will be required. It is very expensive to review a lot of documents since most of the reviewers are experts in the specific fields such as patent attorneys, lawyers, marketing, and medical professionals. However, the existing HRR methods have been far from satisfactory to make them enumerate all relevant documents. This is due to the fact that not only the sheer volume of documents inevitably including noises (non-relevant documents) but also the threshold measurements have been inadequately adopted. To deal with these problems, we propose a novel solution to efficiently find all the relevant documents among a large set of results. It consists of two steps: (a) to effectively classify the entire documents and (b) to select the representative documents in each class. We formalized the problem and theoretically verified the upper-bound of our method. In the experiments, our method is more efficient than the state-of-the-art query expansion methods.

论文关键词:High Recall Retrieval problem,Patent retrieval,Dynamic retrieval,Independent Dominating Set problem

论文评审过程:Available online 22 July 2017, Version of Record 8 November 2019.

论文官网地址:https://doi.org/10.1016/j.datak.2017.07.006