Assessing GPT as a Weak Oracle for Annotating Radiological Studies
Abstract
The development of robust deep learning systems for radiology requires large annotated datasets, which are costly and time-consuming to produce manually. Recent advances in large language models (LLMs) suggest these models could serve as automated annotators for radiological studies. However, deploying LLMs as surrogates for human annotators raises concerns about scalability, data quality, and privacy. Additionally, the interpretability of annotations from black-box LLMs remains limited without downstream validation. This paper demonstrates that lightweight language models trained on LLM-generated annotations can match the performance of larger black-box LLMs while providing a more efficient and auditable solution. We present a dataset of 42605 CT radiology reports in Spanish with LLM-generated labels for three pulmonary abnormalities (consolidation, cysts, and nodules), including a human-validated subset for evaluation. Our experiments with models spanning different specialization domains show that training on LLM-generated annotations significantly outperforms traditional supervised learning on limited expert-labeled data. Despite GPT acting as a noisy oracle, models trained with these weak supervision signals achieved strong performance (micro F1 score of 0.88) on radiologist-verified test samples. Notably, while self-supervised pre-training improved performance of supervised learning on small datasets, the extensive LLM-generated annotations proved superior by a wide margin.
Más información
Editorial: | Springer, Cham |
Fecha de publicación: | 2025 |
Año de Inicio/Término: | 2025 |
Página de inicio: | 98 |
Página final: | 109 |
Idioma: | Inglés |
URL: | https://link.springer.com/chapter/10.1007/978-3-031-95838-0_10#citeas |
DOI: |
https://doi.org/10.1007/978-3-031-95838-0_10 |