Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants
Keywords: psychometric analysis, anesthesiology certification, clinical reasoning assessment, language model benchmarking, medical AI evaluation, non-English medical exams, Spanish-language healthcare, zero-shot prompting
Abstract
Large Language Models (LLMs) have demonstrated strong performance on English-language medical exams, but their effectiveness in non-English, high-stakes environments is less understood. This study benchmarks nine LLMs against human examinees on the Chilean Anesthesiology Certification Exam (CONACEM), a Spanish-language board examination. A curated set of 63 multiple-choice questions was used, categorized by Blooms taxonomy into four cognitive levels. Model responses were assessed using Item Response Theory and Classical Test Theory, complemented by additional error analysis, categorizing errors as reasoning-based, knowledge-based, or comprehension-related. Closed-source models surpassed open-source models, with GPT-o1 achieving the highest accuracy (88.7%). Deepseek-R1 is a strong performer among open-source options. Item difficulty significantly predicted the model accuracy, while discrimination did not. Most errors occurred in application and understanding tasks and were linked to flawed reasoning or knowledge misapplication. These results underscore LLMs potential for factual recall in Spanish medical exams but also their limitations in complex reasoning. Incorporating cognitive classification and error taxonomy provides deeper insights into model behavior and supports their cautious use as educational aids in clinical settings. © 2025 by the authors.
Más información
| Título según WOS: | Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants |
| Título según SCOPUS: | Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants |
| Título de la Revista: | Applied Sciences (Switzerland) |
| Volumen: | 15 |
| Número: | 11 |
| Editorial: | Multidisciplinary Digital Publishing Institute (MDPI) |
| Fecha de publicación: | 2025 |
| Idioma: | English |
| DOI: |
10.3390/app15116245 |
| Notas: | ISI, SCOPUS |