Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam
Keywords: anesthesiology, large language models, clinical decision support, GPT-4o, medical AI evaluation, Spanish-language exam, Diagnostic reasoning
Abstract
Background: Large language models (LLMs) such as GPT-4o have the potential to transform clinical decision-making, patient education, and medical research. Despite impressive performance in generating patient-friendly educational materials and assisting in clinical documentation, concerns remain regarding the reliability, subtle errors, and biases that can undermine their use in high-stakes medical settings. Methods: A multi-phase experimental design was employed to assess the performance of GPT-4o on the Chilean anesthesiology exam (CONACEM), which comprised 183 questions covering four cognitive domainsUnderstanding, Recall, Application, and Analysisbased on Blooms taxonomy. Thirty independent simulation runs were conducted with systematic variation of the models temperature parameter to gauge the balance between deterministic and creative responses. The generated responses underwent qualitative error analysis using a refined taxonomy that categorized errors such as Unsupported Medical Claim, Hallucination of Information, Sticking with Wrong Diagnosis, Non-medical Factual Error, Incorrect Understanding of Task, Reasonable Response, Ignore Missing Information, and Incorrect or Vague Conclusion. Two board-certified anesthesiologists performed independent annotations, with disagreements resolved by a third expert. Statistical evaluationsincluding one-way ANOVA, non-parametric tests, chi-square, and linear mixed-effects modelingwere used to compare performance across domains and analyze error frequency. Results: GPT-4o achieved an overall accuracy of 83.69%. Performance varied significantly by cognitive domain, with the highest accuracy observed in the Understanding (90.10%) and Recall (84.38%) domains, and lower accuracy in Application (76.83%) and Analysis (76.54%). Among the 120 incorrect responses, unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%). Co-occurrence analyses revealed that unsupported claims often appeared alongside imprecise conclusions, highlighting a trend of compounded errors particularly in tasks requiring complex reasoning. Inter-rater reliability for error annotation was robust, with a mean Cohens kappa of 0.73. Conclusions: While GPT-4o exhibits strengths in factual recall and comprehension, its limitations in handling higher-order reasoning and diagnostic judgment are evident through frequent unsupported medical claims and vague conclusions. These findings underscore the need for improved domain-specific fine-tuning, enhanced error mitigation strategies, and integrated knowledge verification mechanisms prior to clinical deployment. © The Author(s) 2025.
Más información
| Título según WOS: | Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam |
| Título según SCOPUS: | Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam |
| Título de la Revista: | BMC Medical Education |
| Volumen: | 25 |
| Número: | 1 |
| Editorial: | BIOMED CENTRAL LTD |
| Fecha de publicación: | 2025 |
| Idioma: | English |
| DOI: |
10.1186/s12909-025-08084-9 |
| Notas: | ISI, SCOPUS |