From content to links: Social image embedding with deep multimodal model

摘要

With the popularity of social network, social media data embedding has attracted extensive research interest and boomed many applications, such as image classification and cross-modal retrieval. In this paper, we examine the scenario of social images containing multimodal content (e.g., visual content and textual tags) and connecting with each other (e.g., two images submitted to the same group). In such a case, both the multimodal content and link information provide useful clues for representation learning. Therefore, simply learning the embedding from network structure or data content results in sub-optimal social image representation. In this paper, we propose a Deep Multimodal Attention Networks (DMAN) to combine multimodal content and link information for social image embedding. Specifically, to effectively incorporate the multimodal content, a visual-textual attention model is proposed to encode the fine-granularity correlation between multimodal content, i.e., the alignment between image regions and textual words. To incorporate the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the first-order proximity and the second-order proximity among images. Then the two modules are integrated into a joint deep model for social image embedding. Once the representation has been learned, a wide variety of data mining problems can be solved by using the task-specific algorithms designed for handling vector representations. Extensive experiments are conducted to demonstrate the effectiveness of our approach on multi-label classification and cross-modal search.