Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss
Abstract
Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels. We present comprehensive experiments on MS-COCO and Flickr30K, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. Our code is publicly available on GitHub https://github.com/MariodotR/FullHN.git.
Más información
Título según SCOPUS: | ID SCOPUS_ID:85174249848 Not found in local SCOPUS DB |
Título de la Revista: | Lecture Notes in Computer Science |
Volumen: | 14276 LNAI |
Editorial: | Springer, Cham |
Fecha de publicación: | 2023 |
Página de inicio: | 249 |
Página final: | 264 |
DOI: |
10.1007/978-3-031-45275-8_17 |
Notas: | SCOPUS |