ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis

McHugh J, Challener D, Tabaja H. Change of Heart: Can Artificial Intelligence Transform Infective Endocarditis Management?. Pathogens 2025;14(4):371 View
Elkin P, Mehta G, LeHouillier F, Resnick M, Mullin S, Tomlin C, Resendez S, Liu J, Nebeker J, Brown S. Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE. JAMA Network Open 2025;8(4):e256359 View
Tekin M, Yurdal M, Toraman Ç, Korkmaz G, Uysal İ. Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination. BMC Medical Education 2025;25(1) View
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
Bolgova O, Ganguly P, Mavrych V. Comparative analysis of LLMs performance in medical embryology: A cross‐platform study of ChatGPT, Claude, Gemini, and Copilot. Anatomical Sciences Education 2025;18(7):718 View
Wang W, Fu J, Zhang Y, Hu K. A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam. Journal of Cancer Education 2026;41(2):256 View
Wu J, Wang Z, Qin Y. Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study. Journal of Medical Systems 2025;49(1) View
Altermatt F, Neyem A, Sumonte N, Villagrán I, Mendoza M, Lacassie H. Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants. Applied Sciences 2025;15(11):6245 View
Bruneti Severino J, Nespolo Berger M, Basei de Paula P, Loures F, Todeschini S, Roeder E, Han Veiga M, Knopfholz J, Lenci Marques G. Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology's Certification Exam. International Journal of Cardiovascular Sciences 2025;38 View
Solomon T, Laye M, Ahmed S. The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability. PLOS One 2025;20(6):e0325982 View
Ucdal M, Bakhshandehpour A, Durak M, Balaban Y, Kekilli M, Simsek C. Evaluating the Role of Artificial Intelligence in Making Clinical Decisions for Treating Acute Pancreatitis. Journal of Clinical Medicine 2025;14(12):4347 View
Antillón F. Inteligencia Artificial en Educación Médica: ¿hacia dónde?. Revista de la Facultad de Medicina 2025;3(1):4 View
Liu Y, Yuan Y, Yan K, Li Y, Sacca V, Hodges S, Cannistra M, Jeong P, Wu J, Kong J. Evaluating the role of large language models in traditional Chinese medicine diagnosis and treatment recommendations. npj Digital Medicine 2025;8(1) View
Karamanlıoğlu A, Demirel B, Tural O, Doğan O, Alpaslan F. Privacy-Preserving Clinical Decision Support for Emergency Triage Using LLMs: System Architecture and Real-World Evaluation. Applied Sciences 2025;15(15):8412 View
Dagi A, Jones N, Bogue J. When does ChatGPT refer someone to a plastic surgeon?. Journal of Plastic, Reconstructive & Aesthetic Surgery 2025;109:20 View
Zhong R, Chen S, Li Z, Gao T, Su Y, Zhang W, Liu D, Gao L, Hu K. Large Language Models in Lung Cancer: A Systematic Review (Preprint). Journal of Medical Internet Research 2025 View
Feng Y. Can LLMs effectively assist medical coding? Evaluating GPT performance on DRG and targeted clinical tasks. BMC Medical Informatics and Decision Making 2025;25(1) View
Gül M. Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents. BMC Medical Education 2025;25(1) View
Prasad S, Travis L, Thornton M, Thaller S. Comparison of GPT-4o and o3-Mini on Otolaryngology USMLE-Style Questions. Journal of Craniofacial Surgery 2026;37(3/4):827 View
Masanneck L, Epping P, Meuth S, Pawlitzki M. Evaluating Web Retrieval–Assisted Large Language Models With and Without Whitelisting for Evidence-Based Neurology: Comparative Study. Journal of Medical Internet Research 2025;27:e79379 View
Zhang Y, Xie X, Xu Q. ChatGPT in Medical Education: Bibliometric and Visual Analysis. JMIR Medical Education 2025;11:e72356 View
Landon S, Savage T, Greysen S, Bressman E. Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios. Journal of General Internal Medicine 2026;41(5):1286 View
Alohali K, Almusaeeb L, Almubarak A, Alohali A, Muaygil R. Reasoning-based LLMs surpass average human performance on medical social skills. Scientific Reports 2025;15(1) View
Gérard A, Lombardi R, Merino D, Bouveyron C, Dellamonica J, Drici M, Lavrut T, Destere A. A new chapter in pharmacology: Artificial intelligence's expanding role in pharmacokinetics, pharmacodynamics, and pharmacovigilance. Therapies 2026;81(2):159 View
Kasagga A, Sapkota A, Changaramkumarath G, Abucha J, Wollel M, Somannagari N, Husami M, Hailu K, Kasagga E. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus 2025 View
Sabouni S, Moufti M, Taha M. From Hype to Implementation: Embedding GPT-4o in Medical Education. JMIR Medical Education 2025;11:e79309 View
Ahn J, Kang B, Chang M, Yoon S. Applications and Future Perspectives of Large Language Models in Otolaryngology-Head and Neck Surgery: A Comprehensive Survey. Clinical and Experimental Otorhinolaryngology 2025;18(4):283 View
Chen G, Lin C, Zhang L, Luo Z, Shin Y, Li X. Virtual case reasoning and AI-assisted diagnostic instruction: an empirical study based on body interact and large language models. BMC Medical Education 2025;25(1) View
Tao L, Liu J, Lu X, Zhao Y, Zhang Y, Zhu Z, Li T, Zhang Z, Zhang Y, Yan W, Liu M, Liang W. Performance of the large language model in general medicine. Global Transitions 2026;8(1):101 View
Özler Z, Karaman B, Atalay E. ASSESSING THE PERFORMANCE OF WIDELY USED LARGE LANGUAGE MODELS ACROSS MEDICAL DISCIPLINES USING USMLE-STYLE EXAM QUESTIONS: AN IN-DEPTH EVALUATION. TURKISH MEDICAL STUDENT JOURNAL 2025 View
Cevallos López G, Ubillús Reyes J, Chocobar Reyes E. Argumentos a favor de permitir o prohibir el uso de la inteligencia artificial generativa por estudiantes. Una revisión sistemática. European Public & Social Innovation Review 2025;11:1 View
Al‐Haj Ali S. Reliability of Multimodal AI for Assessing Preclinical Stainless Steel Crown Preparations: A Comparative Study With Human Experts. International Journal of Paediatric Dentistry 2026;36(2):275 View
Wang W, Zhou Y, Fu J, Hu K. Evaluating the Performance of DeepSeek-R1 and DeepSeek-V3 Versus OpenAI Models in the Chinese National Medical Licensing Examination: Cross-Sectional Comparative Study. JMIR Medical Education 2025;11:e73469 View
Kuhn S, Knitza J. Leitliniengerechte Osteoporoseversorgung durch LLMs? Ein Scoping Review zum Potenzial generativer KI. Osteologie 2025;34(04):250 View
Zeng J, Qi W, Shen S, Liu X, Li S, Wang B, Dong C, Zhu X, Shi Y, Lou X, Wang B, Yao J, Jiang G, Zhang Q, Cao S. Embracing the Future of Medical Education With Large Language Model–Based Virtual Patients: Scoping Review. Journal of Medical Internet Research 2025;27:e79091 View
Akinniranye O, Akinniranye O. Performance of Large Language Models and Top-Decile Doctors on an Undergraduate Ophthalmology Examination. Cureus 2025 View
Li S. Towards A Fair Duel: Reflections on the Evaluation of DeepSeek-R1 and ChatGPT-4o in Chinese Medical Education. Journal of Medical Systems 2025;49(1) View
Diniz P, Yokoe T, Öttl F, Pereira H, Henriques R, Samuelsson K. Reasoning‐optimised large language models reach near‐expert accuracy on board‐style orthopaedic exams: A multi‐model comparison on 702 multiple‐choice questions. Knee Surgery, Sports Traumatology, Arthroscopy 2026;34(2):752 View
Pornwattanakavee S, Leelakanok N, Todsarot T, Guinto G, Takun R, Sumativit A, Senngam M. Effectiveness of ChatGPT, Google Gemini, and Microsoft Copilot in Answering Thai Drug Information Queries: Cross-Sectional Study. JMIR AI 2025;4:e79751 View
李子. Exploration and Practice of ChatGPT Combined with ASSURE Evaluation in Clinical Medicine Undergraduate Teaching. Nursing Science 2025;14(12):2456 View
Kaleci A, Şahinbaş B, Ağadayı E, Çelikkaya S, Altun A, Kardan E. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. Tıp Eğitimi Dünyası 2025;24(74):135 View
Sussan T, Sussan R, Atkinson A, Atkinson I, Cunningham K, Eckroth J, Miller L, Wei T. A Comparative Evaluation of GPT-4 Turbo and Gemini-Pro in Medical Licensing Exams: Enhancing Artificial Intelligence's Role in Medical Education. Cureus 2026 View
Tate H, Bicknell B, Fiore P, Galloway J, Borak J, Brooks W. Medical students’ perceptions of AI-generated practice questions as learning tools. Journal of Investigative Medicine 2025 View
Siam M, Varela A, Faruk M, Cheng J, Gu H, Maruf A, Aung Z. Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios. Scientific Reports 2025;16(1) View
Nachnani E, Goel K, Sullivan A. Comparative performance of large language models on cardiovascular certification simulation exam. American Heart Journal 2026;295:107353 View
Xin J, He X. Evaluating Large Language Models as Medical Consultation Tools for Double Eyelid Surgery: A Cross-Language Study in English and Chinese. Aesthetic Plastic Surgery 2026;50(5):1706 View
He Q, Tan Z, Niu W, Chen D, Zhang X, Qin F, Yuan J. From algorithms to operating room: can large language models master China’s attending anesthesiology exam? A cross-sectional evaluation. International Journal of Surgery 2026;112(1):190 View
Nowroozi A, Bondarenko M, Serapio A, Schnitzler T, Brar S, Sohn J. Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis. Journal of Imaging Informatics in Medicine 2026 View
Choi J, Yoo T. Transforming clinical medicine with multimodal artificial intelligence, agentic systems, and the model-context protocol: a perspective on future directions. Discover Health Systems 2026;5(1) View
Yi J, Du F, Nie Y, Liang W, Zhou X, Chen J, Li G, Liu M, Lv Y, Zhao W, Hou X. GAI-HIQ: Developing a health information quality assessment indicator system for generative artificial intelligence. Information Processing & Management 2026;63(5):104651 View
Zhang Y, Huang T, Liu C, Miller A, Yang M, Harris I, Sawaguchi T, Miclau T, Tian M, Chui C, Zhang N, Cheung W, Wong R. Comparative evaluation of large language models for hip fracture-related patient questions: DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5. DIGITAL HEALTH 2026;12 View
Sheng C, Shen S, Wang L, Chen J, Chen W, Wang S, Wang N. Multimodal large language models challenge NEJM image challenge. Scientific Reports 2026;16(1) View
Gupta S, Gal Z, Touray J, Luiselli G, Ceesay A, Manneh E, Cham M, Rolston J, Chrenek R, Bah M, Golby A, Esene I, Arnaout O, Sanchez C, Janneh L, Smith T, Jabang J. Large Language Model for Postoperative Clinical Decision Support in a Neurosurgery Ward in the Gambia: A Prospective Pilot Feasibility Study. Neurosurgery 2026 View
Stelling H, Kraus A, Grieb G, Güler I. Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS). European Journal of Investigation in Health, Psychology and Education 2026;16(2):23 View
Zhou C, Gong Q, Luan H, Zhan W, Zhu J, Zhang Q. Fine-tuned large language models with structured prompts enable efficient construction of lung cancer knowledge graphs. Scientific Reports 2026;16(1) View
Grundmeier R, Fiks A, Jenssen B, Proctor S, Ferro D, Johnson K. Generative Artificial Intelligence: Implications for Families and Pediatricians. Pediatrics 2026;157(4) View
Mainali R, Deng A, Rayala H, Southerland A. Potential Applications of Artificial Intelligence in Neurology Education. Seminars in Neurology 2026;46(01):e1 View
Kim M, Park J, Kang S. Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions. Scientific Reports 2026;16(1) View
Xu L, Zhao W, Qin Y, Wang J. Performance of Large Language Models on Exam-style Questions and Case Challenges Across Varying Levels of Complexity. Journal of Medical Systems 2026;50(1) View
Güler I, Grieb G, Kraus A, Stelling H. Artificial Intelligence in Plastic Surgery Education: A Global Multimodel Benchmark of Large Language Models on the Plastic Surgery In-Service Training Examination. Aesthetic Surgery Journal Open Forum 2026;8 View
Duman Şahin Z, Altuntaş V, Yılmaz Muluk S. Use of natural language processing tools in musculoskeletal disability assessment: generating reports and calculating impairment percentages in Turkish health commission settings. Disability and Rehabilitation: Assistive Technology 2026:1 View
Lee W, Kim J, Leem J, Lee B, Lee S, Kim Y. Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Applied Sciences 2026;16(7):3377 View
Ahmad M, Rahman T, Kasera N, Ahmed S, Murray R. Large Language Models in Cellulose Biopolymer Studies: Evaluating ChatGPT and Microsoft Copilot for Information and Reference Accuracy. International Journal of Intelligent Systems 2026;2026(1) View
Jung K, Kim H, Shin S, Lee W, Lee J, Park H, Choi Q. Evaluation of the Performance of Advanced Large Language Models in Laboratory Medicine Using Residency Examinations. Annals of Laboratory Medicine 2026;46(3):327 View
Lu M, Cheng J, Gopalan V. Performance of multimodal large language models on image‐based surgical anatomy, anatomical pathology, and radiology questions. Anatomical Sciences Education 2026 View
Paslı B, Günenç Beşer C. Is artificial intelligence getting better at anatomy? A two‐year review of ChatGPT's free public versions. Anatomical Sciences Education 2026 View
Cheverko C, Mavrych V, Bolgova O, Mohamed F, Westrick J, Juarez L, Rush E, Solka K, Doubleday A, Byram J, Becker R, Gomez V, Ganeng B, Hoffman L, Roach V, Brown K, DeVaul N, Garnett C, Herriott H, Lufler R, Mussell J, Balta J, Pascoe M, Middleton J, Duffy S, Stephens G, Wilson A. The performance of ChatGPT and other large language models on multiple‐choice questions in biomedical disciplines: A meta‐analysis. Anatomical Sciences Education 2026 View
Güler I, Grieb G, Kraus A, Moog P, Cambaz U, Yavasca E, Stelling H. Artificial Intelligence in Medical Assessment: Reliability and Performance of Multimodal Large Language Models in a High-Stakes Licensing Examination. Behavioral Sciences 2026;16(5):822 View
Al‐Haj Ali S. Benchmarking Multi‐Modal Artificial Intelligence Models Against Student Performance: The Role of Question Characteristics in Objective Structured Practical Dental Examinations. European Journal of Dental Education 2026 View
Habibi G, Gargari O, Hosseini M, Afchangi K, Saleem S. Performance of Large Language Models in Neurology Multiple‐Choice Questions. Acta Neurologica Scandinavica 2026;2026(1) View
Carrillo-Larco R. PeruMedQA: A Stress Evaluation Using Ten Large Language Models to Answer Medical Exams. Medical Science Educator 2026;36(3):1091 View
Boyd N, Hashimoto D, DeLong J. Integrating Artificial Intelligence and Technology in Surgical Practice. Surgical Clinics of North America 2026 View
Niu Z, Tang D, Chen J, Zhang P, Zhu C. Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns. Frontiers in Digital Health 2026;8 View
Chang J, Peng R, Chen X, Zhu Y, Miao R, Cao Z, Feng H. Applications, Challenges, and Future Directions of Large Language Models in Health Care Communication: Scoping Review. Journal of Medical Internet Research 2026;28:e84726 View
Wang Z, Qin Y, Wu J. Performance stability despite iteration: evaluating DeepSeek and ChatGPT on Chinese medical licensing examinations. Frontiers in Medicine 2026;13 View
Zhu Z, Zhao Y, Li L, Wang X, Zhang Y, Zhao X. Artificial Intelligence Performance Under Different Conditions in Answering China's Standardized Training Examination for Resident Physician in Radiology: A Comparative Analysis. Health Care Science 2026 View
Cordier W, Ijeoma P. Mapping Current Use of Artificial Intelligence in Pharmacology Education via a Scoping Review. Pharmacology Research & Perspectives 2026;14(4) View
Khan H, Mirza I, Aftab O, Shah Y, Al‐Khazraji A. Evaluating ChatGPT‐5 for Detection of Barrett's Esophagus and Grading of Esophagitis: A Multiclass Endoscopic Image Analysis. JGH Open 2026;10(7) View

Books/Policy Documents

Yoo Y, Georgescu B, Zhang Y, Grbic S, Liu H, Aldea G, Re T, Das J, Ullaskrishnan P, Eibenberger E, Chekkoury A, Bodanapally U, Nicolaou S, Sanelli P, Schroeppel T, Lui Y, Gibson E. Medical Image Computing and Computer Assisted Intervention – MICCAI 2025. View

Conference Proceedings

Shetgaonkar A, Pradhan D, Arora L, Girija S, Raj A, Kapoor S. 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC). Mitigating Clinician Information Overload: Generative AI for Integrated EHR and RPM Data Analysis View
Menderes U, Morár D, Vu N. 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Evaluating the Reliability and Utility of GPT-4o as a Medical Expert Across Different Interaction Modes View
Anand A, Ganesan D, Karkar R. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. High Accuracy and Hidden Disparities: Investigating Foundation Model Performance in Clinical Cognitive Assessment View

Citation

Please cite as:

Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, Spaedy O, Skelton A, Edupuganti N, Dzubinski L, Tate H, Dyess G, Lindeman B, Lehmann LS
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis
JMIR Med Educ 2024;10:e63430
doi: 10.2196/63430 PMID: 39504445 PMCID: 11611793

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Artificial Intelligence (AI) in Medical Education (706) Testing and Assessment in Medical Education (209) New Methods and Approaches in Medical Education (621) Quality of Medical Educational and Instructional Material (49) Generative Language Models Including ChatGPT (1455)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn