Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”

doi:10.2196/48305

Published on 13.Jul.2023 in Vol 9 (2023)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/48305, first published 18.Apr.2023.

Man in suit on phone at desk, clock overhead, abstract art

Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”

Richard H Epstein¹

; Franklin Dexter²

Article Authors Cited by (18) Tweetations (2) Metrics

Journals

Velásquez-Henao J, Franco-Cardona C, Cadavid-Higuita L. Prompt Engineering: a methodology for optimizing interactions with AI-Language Models in the field of engineering. DYNA 2023;90(230):9 View
Gilson A, Safranek C, Huang T, Socrates V, Chi L, Taylor R, Chartash D. Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. JMIR Medical Education 2023;9:e50336 View
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clinics and Practice 2023;13(6):1460 View
Meyer A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Medical Education 2024;10:e50965 View
Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu N, Bartlett R, Hanson J, Haas M, Spadafore M, Grafton-Clarke C, Gasiea R, Michie C, Corral J, Kwan B, Dolmans D, Thammasitboon S. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Medical Teacher 2024;46(4):446 View
Duggan R, Tsuruda K. ChatGPT performance on radiation technologist and therapist entry to practice exams. Journal of Medical Imaging and Radiation Sciences 2024;55(4):101426 View
Pohl N, Derector E, Rivlin M, Bachoura A, Tosti R, Kachooei A, Beredjiklian P, Fletcher D. A quality and readability comparison of artificial intelligence and popular health website education materials for common hand surgery procedures. Hand Surgery and Rehabilitation 2024;43(3):101723 View
Niset A, El Hadwe S, Englebert A, Barrit S. AI in emergency medicine: Building literacy or castles in the air. The American Journal of Emergency Medicine 2025;87:145 View
Chen C, Bilolikar V, VanNest D, Raphael J, Shaffer G. Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In‐Training Examination performance. Medicine Advances 2024;2(3):284 View
Aster A, Laupichler M, Rockwell-Kollmann T, Masala G, Bala E, Raupach T. ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review. Medical Science Educator 2024;35(1):555 View
Kanzawa J, Kurokawa R, Kaiume M, Nakamura Y, Kurokawa M, Sonoda Y, Gonoi W, Abe O. Evaluating the Role of GPT-4 and GPT-4o in the Detectability of Chest Radiography Reports Requiring Further Assessment. Cureus 2024 View
Buhl L. The answer may vary: large language model response patterns challenge their use in test item analysis. Medical Teacher 2025;47(11):1761 View
Shaikh Y, Jeelani-Shaikh Z, Jeelani M, Javaid A, Mahmud T, Gaglani S, Gibbons M, Cheema M, Cross A, Livingston D, Cheatham M, Nezami E, Dixon R, Niranjan-Azadi A, Zafar S, Siddiqui Z, Villanueva C. Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE. PLOS Digital Health 2025;4(10):e0000787 View
Kopka M, Feufel M. Increasing Large Language Model Accuracy for Care-Seeking Advice Using Prompts Reflecting Human Reasoning Strategies in the Real World: Validation Study. JMIR Biomedical Engineering 2026;11:e88053 View
Zhan X, Yu W, Cai J, Chen J, Amankwaa I. From knowledge to judgment: A three-year longitudinal analysis of artificial intelligence large language model performance on the Chinese national nurse licensing examination. PLOS One 2026;21(7):e0353059 View

Books/Policy Documents

Burbano G. D, Ibarra C. J. Telematics and Computing. View

Conference Proceedings

Hutt S, DePiro A, Wang J, Rhodes S, Baker R, Hieb G, Sethuraman S, Ocumpaugh J, Mills C. Proceedings of the 14th Learning Analytics and Knowledge Conference. Feedback on Feedback: Comparing Classic Natural Language Processing and Generative AI to Evaluate Peer Feedback View
Sallou J, Durieux T, Panichella A. Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results. Breaking the Silence: the Threats of Using LLMs in Software Engineering View

This paper is in the following e-collection/theme issue:

Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”

Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”

Journals

Books/Policy Documents

Conference Proceedings