Assessing GPT as a Weak Oracle for Annotating Radiological Studies

De Ferrari, J.; Ñanculef, R.; Benoit, D.; Araya, M.; Solar, M; Bellazzi, R.

Abstract

The development of robust deep learning systems for radiology requires large annotated datasets, which are costly and time-consuming to produce manually. Recent advances in large language models (LLMs) suggest these models could serve as automated annotators for radiological studies. However, deploying LLMs as surrogates for human annotators raises concerns about scalability, data quality, and privacy. Additionally, the interpretability of annotations from black-box LLMs remains limited without downstream validation. This paper demonstrates that lightweight language models trained on LLM-generated annotations can match the performance of larger black-box LLMs while providing a more efficient and auditable solution. We present a dataset of 42605 CT radiology reports in Spanish with LLM-generated labels for three pulmonary abnormalities (consolidation, cysts, and nodules), including a human-validated subset for evaluation. Our experiments with models spanning different specialization domains show that training on LLM-generated annotations significantly outperforms traditional supervised learning on limited expert-labeled data. Despite GPT acting as a noisy oracle, models trained with these weak supervision signals achieved strong performance (micro F1 score of 0.88) on radiologist-verified test samples. Notably, while self-supervised pre-training improved performance of supervised learning on small datasets, the extensive LLM-generated annotations proved superior by a wide margin.

Más información

Editorial: Springer, Cham
Fecha de publicación: 2025
Año de Inicio/Término: 2025
Página de inicio: 98
Página final: 109
Idioma: Inglés
URL: https://link.springer.com/chapter/10.1007/978-3-031-95838-0_10#citeas
DOI:

https://doi.org/10.1007/978-3-031-95838-0_10