BitHash: An efficient bitwise Locality Sensitive Hashing method with applications

作者:

Highlights:

摘要

Locality Sensitive Hashing has been applied to detecting near-duplicate images, videos and web documents. In this paper we present a Bitwise Locality Sensitive method by using only one bit per hash value (BitHash), the storage space for storing hash values is significantly reduced, and the estimator can be computed much faster. The method provides an unbiased estimate of pairwise Jaccard similarity, and the estimator is a linear function of Hamming distance, which is very simple. We rigorously analyze the variance of One-Bit Min-Hash (BitHash), showing that for high Jaccard similarity. BitHash may provide accurate estimation, and as the pairwise Jaccard similarity increases, the variance ratio of BitHash over the original min-hash decreases. Furthermore, BitHash compresses each data sample into a compact binary hash code while preserving the pairwise similarity of the original data. The binary code can be used as a compressed and informative representation in replacement of the original data for subsequent processing. For example, it can be naturally integrated with a classifier like SVM. We apply BitHash to two typical applications, near-duplicate image detection and sentiment analysis. Experiments on real user’s photo collection and a popular sentiment analysis data set show that, the classification accuracy of our proposed method for two applications could approach the state-of-the-art method, while BitHash only requires a significantly smaller storage space.

论文关键词:Locality Sensitive Hashing,BitHash,Near-duplicate detection,Machine learning,Sentiment analysis,Storage efficiency

论文评审过程:Received 26 September 2015, Revised 11 January 2016, Accepted 18 January 2016, Available online 23 January 2016, Version of Record 20 February 2016.

论文官网地址:https://doi.org/10.1016/j.knosys.2016.01.022