VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records
Keywords: narrative extraction, coherence metrics, visual narrative evaluation, VLM-as-a-judge, multimodal evaluation
Abstract
Evaluating the coherence of visual narrative sequences extracted from image collections remains a challenge in digital humanities and computational journalism. While mathematical coherence metrics based on visual embeddings provide objective measures, they require computational resources and technical expertise to interpret. We propose using vision-language models (VLMs) as judges to evaluate visual narrative coherence, comparing two approaches: caption-based evaluation that converts images to text descriptions and direct vision evaluation that processes images without intermediate text generation. Through experiments on 126 narratives from historical photographs, we show that both approaches achieve weak-to-moderate correlations with mathematical coherence metrics (r = 0.280.36) while differing in reliability and efficiency. Direct VLM evaluation achieves higher inter-rater reliability ((Formula presented.) vs. (Formula presented.)) but requires (Formula presented.) more computation time after initial caption generation. Both methods successfully discriminate between human-curated, algorithmically extracted, and random narratives, with all pairwise comparisons achieving statistical significance ((Formula presented.), with five of six comparisons at (Formula presented.)). Human sequences consistently score highest, followed by algorithmic extractions, then random sequences. Our findings indicate that the choice between approaches depends on application requirements: caption-based for efficient large-scale screening versus direct vision for consistent curatorial assessment. © 2025 by the authors.
Más información
| Título según WOS: | VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records |
| Título según SCOPUS: | VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records |
| Título de la Revista: | Electronics (Switzerland) |
| Volumen: | 14 |
| Número: | 21 |
| Editorial: | Multidisciplinary Digital Publishing Institute (MDPI) |
| Fecha de publicación: | 2025 |
| Idioma: | English |
| DOI: |
10.3390/electronics14214199 |
| Notas: | ISI, SCOPUS |