Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Funk P, Hoch C, Knoedler S, Knoedler L, Cotofana S, Sofo G, Bashiri Dezfouli A, Wollenberg B, Guntinas-Lichius O, Alfertshofer M. ChatGPT’s Response Consistency: A Study on Repeated Queries of Medical Examination Questions. European Journal of Investigation in Health, Psychology and Education 2024;14(3):657 View
Knoedler L, Knoedler S, Hoch C, Prantl L, Frank K, Soiderer L, Cotofana S, Dorafshar A, Schenck T, Vollbach F, Sofo G, Alfertshofer M. In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions. Scientific Reports 2024;14(1) View
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2024;26:e60807 View
Hibino M, Gillinov M. “Pseudo” Intelligence or Misguided or Mis-sourced Intelligence?. The Annals of Thoracic Surgery 2024;118(1):281 View
Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Diagnostic performance of generative artificial intelligences for a series of complex case reports. DIGITAL HEALTH 2024;10 View
Davis N, El-Said E, Fortune P, Shen A, Succi M. Transforming Health Care Landscapes: The Lever of Radiology Research and Innovation on Emerging Markets Poised for Aggressive Growth. Journal of the American College of Radiology 2024;21(10):1552 View
Jin H, Lee H, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Medical Education 2024;24(1) View
Alfertshofer M, Knoedler S, Hoch C, Cotofana S, Panayi A, Kauke-Navarro M, Tullius S, Orgill D, Austen W, Pomahac B, Knoedler L. Analyzing Question Characteristics Influencing ChatGPT’s Performance in 3000 USMLE®-Style Questions. Medical Science Educator 2024;35(1):257 View
Künzle P, Paris S. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clinical Oral Investigations 2024;28(11) View
Ramgopal S, Varma S, Gorski J, Kester K, Shieh A, Suresh S. Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank. Pediatric Emergency Care 2024;40(12):871 View
Cooperman S, Brandão R. Integrating domain-specific resources: Advancing AI for foot and ankle surgery. Foot & Ankle Surgery: Techniques, Reports & Cases 2025;5(1):100445 View
Hofmann H, Vairavamurthy J. Large language model doctor: assessing the ability of ChatGPT-4 to deliver interventional radiology procedural information to patients during the consent process. CVIR Endovascular 2024;7(1) View
Jin H, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study. JMIR Medical Education 2024;10:e57451 View
Avidan Y, Tabachnikov V, Court O, Khoury R, Aker A. In the face of confounders: Atrial fibrillation detection – Practitioners vs. ChatGPT. Journal of Electrocardiology 2025;88:153851 View
Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. Journal of Medical Internet Research 2024;26:e66114 View
Schnapp B, Sehdev M, Schrepel C, Bord S, Pelletier‐Bui A, Alvarez A, Dubosh N, Park Y, Shappell E. ChatG‐PD? Comparing large language model artificial intelligence and faculty rankings of the competitiveness of standardized letters of evaluation. AEM Education and Training 2024;8(6) View
Kuerbanjiang W, Peng S, Jiamaliding Y, Yi Y. Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study. Journal of Medical Internet Research 2025;27:e63626 View
Smollin K, Smollin C. Will Artificial Intelligence Replace the Medical Toxicologist: Pediatric Referral Thresholds Generated by GPT-4. Journal of Medical Toxicology 2025;21(1):85 View
Ferraz-Costa G, Griné M, Oliveira-Santos M, Teixeira R. Performance of ChatGPT in the Portuguese National Residency Access Examination. Acta Médica Portuguesa 2024;38(3):170 View
Penny P, Bane R, Riddle V. Advancements in AI Medical Education: Assessing ChatGPT’s Performance on USMLE-Style Questions Across Topics and Difficulty Levels. Cureus 2024 View
Hallquist E, Gupta I, Montalbano M, Loukas M. Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus 2025 View
Avidan Y, Naoum I, Khoury R, Zahra S, Dov N, Schliamser J, Danon A, Aker A. Can ChatGPT accurately detect atrial fibrillation using smartwatch ECG?. Heart & Lung 2025;73:90 View
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
Guo F, Li T, Cunningham C. One year in the classroom with ChatGPT: empirical insights and transformative impacts. Frontiers in Education 2025;10 View
Wu H, Zerner T, Lee D, Court-Kowalski S, Devitt P, Palmer E. GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality. Medical Teacher 2025;47(12):1961 View
Zhang Y, Xie X, Xu Q. ChatGPT in Medical Education: Bibliometric and Visual Analysis. JMIR Medical Education 2025;11:e72356 View
Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
Lin Y, Luo Z, Ye Z, Zhong N, Zhao L, Zhang L, Li X, Chen Z, Chen Y. Applications, Challenges, and Prospects of Generative Artificial Intelligence Empowering Medical Education: Scoping Review. JMIR Medical Education 2025;11:e71125 View
Aydinalp M, Doğan B, Bal A. GDD Generation for Hyper-Casual Games Using Large Language Models: A Comparative Evaluation. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi 2025;14(3):1469 View
Brochu B, Cobler-Lichter M, Arcieri T, Shah N, Delamater J, Reyes A, Sussman M, Lineen E, Sands L, Hui V, Rodgers S, Thorson C. Potential and pitfalls: accuracy versus adequacy of ChatGPT’s performance on surgery shelf examination. Global Surgical Education - Journal of the Association for Surgical Education 2025;5(1) View
Rios-Garcia W, Silva-Jiménez S, Gálvez-Rodríguez E, Alberca-Naira Y, Via-y-Rada-Torres A, Rios-Garcia A. Assessment of ChatGPT-5 as an Artificial Intelligence Tool for Exploring Emerging Dimensions of Clinical Simulation: A Proof-of-concept Study. Journal of Medical Systems 2026;50(1) View
Benito P, Isla-Jover M, González-Castro P, Fernández Esparcia P, Carpio M, Blay-Simón I, Gutiérrez-Bedia P, Lapastora M, Carratalá B, Carazo-Casas C. GPT-4o and OpenAI o1 Performance on the 2024 Spanish Competitive Medical Specialty Access Examination: Cross-Sectional Quantitative Evaluation Study. JMIR Medical Education 2026;12:e75452 View
Foster A, Price N, Brown V, Reed S. Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis. Annals of Pharmacy Education, Safety, and Public Health Advocacy 2022;2(1):176 View
Liu M, Okuhara T, Dai Z, Zhao M, Yin W, Okada H, Furukawa E, Kiuchi T. Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination. BMC Medical Informatics and Decision Making 2026;26(1) View
Bokan R, Malhotra R, Vatsa M, Bisht K, Singla M, Choudhary R. Artificial Intelligence in Undergraduate Medical Education: A Cross-Sectional Study of Utilization Patterns and Perceptions Among Medical Students. Cureus 2026 View
Li J, Li Y, Yan J, Yan L, Li Q. Artificial Intelligence in Non-Surgical Cosmetic Procedures: A Multi-Stakeholder Revolution. Aesthetic Plastic Surgery 2026;50(11):4413 View
Kurt Ş, Bahadırlı S. Performance Evaluation of Large Language Models in Emergency Medicine Specialty Examination Questions: A Cross-Sectional Study. Istanbul Medical Journal 2026 View
Güler I, Grieb G, Kraus A, Moog P, Cambaz U, Yavasca E, Stelling H. Artificial Intelligence in Medical Assessment: Reliability and Performance of Multimodal Large Language Models in a High-Stakes Licensing Examination. Behavioral Sciences 2026;16(5):822 View

Conference Proceedings

Arslanoğlu K, Karaköse M. 2025 29th International Conference on Information Technology (IT). A Trustworthy Analysis Approach for Chatbots on Health Data: ChatGPT-4 Example View

This paper is in the following e-collection/theme issue:

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Journals

Conference Proceedings