Multi-label modality enhanced attention based self-supervised deep cross-modal hashing

摘要

The recent deep cross-modal hashing (DCMH) has achieved superior performance in effective and efficient cross-modal retrieval and thus has drawn increasing attention. Nevertheless, there are still two limitations for most existing DCMH methods: (1) single labels are usually leveraged to measure the semantic similarity of cross-modal pairwise instances while neglecting that many cross-modal datasets contain abundant semantic information among multi-labels. (2) several DCMH methods utilized the multi-labels to supervise the learning of hash functions. Nevertheless, the feature space of multi-labels suffers the weakness of sparse, resulting in sub-optimization for the hash functions learning. Thus, this paper proposed a multi-label modality enhanced attention-based self-supervised deep cross-modal hashing (MMACH) framework. Specifically, a multi-label modality enhanced attention module is designed to integrate the significant features from cross-modal data into multi-labels feature representations, aiming to improve its completion. Moreover, a multi-label cross-modal triplet loss is defined based on the criterion that the feature representations of cross-modal pairwise instances with more common categories should preserve higher semantic similarity than other instances. To the best of our knowledge, the multi-label cross-modal triplet loss is the first time designed for cross-modal retrieval. Extensive experiments on four multi-label cross-modal datasets demonstrate the effectiveness and efficiency of our proposed MMACH. Moreover, the MMACH also achieved superior performance and outperformed several state-of-the-art methods on the task of cross-modal retrieval. The source code of MMACH is available at https://github.com/SWU-CS-MediaLab/MMACH.