Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss

Mallea, Mario; Ñanculef, Ricardo; ARAYA-LOPEZ, MAURICIO ALEJANDRO

Abstract

Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels. We present comprehensive experiments on MS-COCO and Flickr30K, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. Our code is publicly available on GitHub https://github.com/MariodotR/FullHN.git.

Más información

Título según SCOPUS: ID SCOPUS_ID:85174249848 Not found in local SCOPUS DB
Título de la Revista: Lecture Notes in Computer Science
Volumen: 14276 LNAI
Editorial: Springer, Cham
Fecha de publicación: 2023
Página de inicio: 249
Página final: 264
DOI:

10.1007/978-3-031-45275-8_17

Notas: SCOPUS