@Article{info:doi/10.2196/64284, author="Wei, Boxiong", title="Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis", journal="JMIR Med Educ", year="2025", month="Jan", day="16", volume="11", pages="e64284", keywords="large language models; LLM; artificial intelligence; AI; GPT-4; radiology exams; medical education; diagnostics; medical training; radiology; ultrasound", abstract="Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using $\chi$2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3{\%}, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62{\%} (93/150; P<.001), Bard 54.7{\%} (82/150; P<.001), Tongyi Qianwen 70.7{\%} (106/150; P=.009), and Gemini Pro 55.3{\%} (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95{\%} CI 0.18‐0.60) for Claude, 0.24 (95{\%} CI 0.13‐0.44) for Bard, and 0.25 (95{\%} CI 0.14‐0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7{\%} (106/150; P=0.02) and had an odds ratio of 0.48 (95{\%} CI 0.27‐0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models' effectiveness in specialized fields like radiology. ", issn="2369-3762", doi="10.2196/64284", url="https://mededu.jmir.org/2025/1/e64284", url="https://doi.org/10.2196/64284" }