Sectional MinHash for near-duplicate detection

作者:

Highlights:

• Sectional MinHash enhances MinHash data structure with locational information.

• Mean Squared Error of the proposed method is one eighth of the MSE of the MinHash.

• Near duplicate detection with the proposed method resulted in more accuracy.

• Setting the number of sections s to 2 gave the best results for the tested dataset.

摘要

•Sectional MinHash enhances MinHash data structure with locational information.•Mean Squared Error of the proposed method is one eighth of the MSE of the MinHash.•Near duplicate detection with the proposed method resulted in more accuracy.•Setting the number of sections s to 2 gave the best results for the tested dataset.

论文关键词:Near-duplicate detection,MinHash,Locality sensitive hashing,Set similarity

论文评审过程:Received 8 August 2017, Revised 18 December 2017, Accepted 11 January 2018, Available online 11 January 2018, Version of Record 6 February 2018.

论文官网地址:https://doi.org/10.1016/j.eswa.2018.01.014