Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

doi:10.2196/50965

Journals

Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia O, Cheungpasitporn W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina 2024;60(3):445 View
Lucas F, Mackie I, d'Onofrio G, Frater J. Responsible use of chatbots to advance the laboratory hematology scientific literature: Challenges and opportunities. International Journal of Laboratory Hematology 2024;46(S1):9 View
Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, Lin A, Zhang J, Luo P. The Evaluation of Generative AI Should Include Repetition to Assess Stability. JMIR mHealth and uHealth 2024;12:e57978 View
Meyer A, Ruthard J, Streichert T. Dear ChatGPT – can you teach me how to program an app for laboratory medicine?. Journal of Laboratory Medicine 2024;48(5):197 View
Kaneda Y, Tayuinosho A, Tomoyose R, Takita M, Hamaki T, Tanimoto T, Ozaki A. Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine. Journal of Evaluation in Clinical Practice 2024;30(6):1017 View
Lee T, Rao A, Campbell D, Radfar N, Dayal M, Khrais A. Evaluating ChatGPT-3.5 and ChatGPT-4.0 Responses on Hyperlipidemia for Patient Education. Cureus 2024 View
Meyer A, Soleman A, Riese J, Streichert T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clinical Chemistry and Laboratory Medicine (CCLM) 2024;62(12):2425 View
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2024;26:e60807 View
Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland’s medical final examination—Is it possible for ChatGPT to become a doctor in Poland?. SAGE Open Medicine 2024;12 View
Nicikowski J, Szczepański M, Miedziaszczyk M, Kudliński B. The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland. Clinical Kidney Journal 2024;17(8) View
Brandtzaeg P, Skjuve M, Følstad A. Understanding model power in social AI. AI & SOCIETY 2025;40(4):2839 View
Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024;15(9):543 View
Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg B, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Discover Applied Sciences 2024;6(10) View
Fan K, Fan K. Dermatological Knowledge and Image Analysis Performance of Large Language Models Based on Specialty Certificate Examination in Dermatology. Dermato 2024;4(4):124 View
Pillai J, Pillai K. ChatGPT as a medical education resource in cardiology: Mitigating replicability challenges and optimizing model performance. Current Problems in Cardiology 2024;49(12):102879 View
Omar M, Nadkarni G, Klang E, Glicksberg B, Silva J. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS Digital Health 2024;3(11):e0000662 View
Maraqa N, Samargandi R, Poichotte A, Berhouet J, Benhenneda R. Comparing performances of french orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery. Orthopaedics & Traumatology: Surgery & Research 2025;111(8):104080 View
Rhyu K. The Surge of Artificial Intelligence (AI) in Scientific Writing: Who Will Hold the Rudder, You or AI?. Hip & Pelvis 2024;36(4):231 View
Syed S, Ahmed R, Iqbal A, Ahmad N, Alshara M. MediScan: A Framework of U-Health and Prognostic AI Assessment on Medical Imaging. Journal of Imaging 2024;10(12):322 View
Lukac S, Griewing S, Leinert E, Dayan D, Heitmeir B, Wallwiener M, Janni W, Fink V, Ebner F. ChatGPT, Google, or PINK? Who Provides the Most Reliable Information on Side Effects of Systemic Therapy for Early Breast Cancer?. Clinics and Practice 2024;15(1):8 View
Maraqa N, Samargandi R, Poichotte A, Berhouet J, Benhenneda R. Comparaison des performances des internes français de chirurgie orthopédique et de l’intelligence artificielle ChatGPT-4/4o aux examens du diplôme d’études spécialisées de chirurgie orthopédique et traumatologique. Revue de Chirurgie Orthopédique et Traumatologique 2025 View
Qiu Y, Liu C. Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment. Global Medical Education 2025;2(1):135 View
Kim J, Vajravelu B. Assessing the Current Limitations of Large Language Models in Advancing Health Care Education. JMIR Formative Research 2025;9:e51319 View
Mustuloğlu Ş, Deniz B. Evaluation of Chatbots in the Emergency Management of Avulsion Injuries. Dental Traumatology 2025;41(4):437 View
Bany Abdelnabi A, Soykan B, Bhatti D, Rabadi G. Usefulness of Large Language Models (LLMs) for Student Feedback on H&P During Clerkship: Artificial Intelligence for Personalized Learning. ACM Transactions on Computing for Healthcare 2025 View
Gehring D, Titus S, George R. The Perceived Concerns of Nurse Educators' Use of GenAI in Nursing Education: Protocol for a Scoping Review. Health Science Reports 2025;8(2) View
Barr A, Quan J, Guo E, Sezgin E. Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data. Frontiers in Artificial Intelligence 2025;8 View
Meyer A, Wetsch W, Steinbicker A, Streichert T. Through ChatGPT’s Eyes: The Large Language Model’s Stereotypes and what They Reveal About Healthcare. Journal of Medical Systems 2025;49(1) View
Altalla’ B, Ahmad A, Bitar L, Al-Bssol M, Al Omari A, Sultan I, Sarkar S. Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis. International Journal of Biomedical Imaging 2025;2025(1) View
Fajt B, Schiller E. ChatGPT in Academia: University Students’ Attitudes Towards the use of ChatGPT and Plagiarism. Journal of Academic Ethics 2025;23(3):1363 View
Hallquist E, Gupta I, Montalbano M, Loukas M. Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus 2025 View
Dobbins N. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Research Synthesis Methods 2025;16(3):479 View
Jongkind R, Elings E, Joukes E, Broens T, Leopold H, Wiesman F, Meinema J. Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study. MedEdPublish 2025;15:11 View
Kaster L, Hillis E, Oh I, Aravamuthan B, Lanzotti V, Vickstrom C, Wasserstein M, Chopra M, Sahin M, Wangler M, Schultz B, Izumi K, Bergner S, Gropman A, Smith-Hicks C, Abbeduto L, Hazlett H, Doherty D, German K, DaWalt L, Neul J, Constantino J, Baldridge D, Srivastava S, Molholm S, Walkley S, Storch E, Samaco R, Cohen J, Shankar S, Piven J, Mahida S, Sveden A, Dies K, Riggs E, Savatt J, Minor B, Gurnett C, Payne P, Gupta A. Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models. Journal of Neurodevelopmental Disorders 2025;17(1) View
Nakaura T, Takamure H, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Funama Y, Hirai T. Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions. Academic Radiology 2025;32(8):4347 View
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
Göçer Gürok N, Öztürk S. The Performance of AI in Dermatology Exams: The Exam Success and Limits of ChatGPT. Journal of Cosmetic Dermatology 2025;24(5) View
Wu H, Zerner T, Lee D, Court-Kowalski S, Devitt P, Palmer E. GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality. Medical Teacher 2025;47(12):1961 View
Wu Y, Wu Y, Chang Y, Yu C, Wu C, Sung W, Atoum I. Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams. PLOS One 2025;20(6):e0324841 View
Hirosawa T, Yokose M, Sakamoto T, Harada Y, Tokumasu K, Mizuta K, Shimizu T. Utility of Generative Artificial Intelligence for Japanese Medical Interview Training: Randomized Crossover Pilot Study. JMIR Medical Education 2025;11:e77332 View
Meyer B, Kfuri‐Rubens R, Schmidt G, Tariq M, Riedel C, Recker F, Riedel F, Kiechle M, Riedel M. Exploring the potential of AI‐powered applications for clinical decision‐making in gynecologic oncology. International Journal of Gynecology & Obstetrics 2025;171(2):698 View
Amini M, Chang P, Davis R, Nguyen D, Dodge J, Phan J, Buxbaum J, Sahakian A. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings. Endoscopy International Open 2025;13(CP) View
Ługowski F, Babińska J, Ludwin A, Stanirowski P. Comparative analysis of ChatGPT 3.5 and ChatGPT 4 obstetric and gynecological knowledge. Scientific Reports 2025;15(1) View
Stenseke J. Counter-productivity and suspicion: two arguments against talking about the AGI control problem. Philosophical Studies 2025 View
Feitosa Filho H, Furtado J, Eulálio E, Ribeiro P, Paiva L, Correia M, Silva Júnior G. ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil. Brazilian Journal of Nephrology 2025;47(4) View
Feitosa Filho H, Furtado J, Eulálio E, Ribeiro P, Paiva L, Correia M, Silva Júnior G. Desempenho do ChatGPT na resposta a questões de residência médica em Nefrologia: um estudo piloto no Brasil. Brazilian Journal of Nephrology 2025;47(4) View
Mavrych V, Yousef E, Yaqinuddin A, Bolgova O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Medical Education Online 2025;30(1) View
George R, Titus S, Gehring D. Nurse Educators' Concerns of GenAI in Education: Scoping Review of Technical Factors. Journal of Nursing Education 2025;64(8):503 View
Nakaura T, Uetani H, Yoshida N, Kobayashi N, Nagayama Y, Kidoh M, Kuroda J, Mukasa A, Hirai T. Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images. European Radiology 2025 View
Polat M, Odabaşı O. Scientific Creativity of Artificial Intelligence: Evaluation of Novel Research Ideas in Oral and Maxillofacial Surgery. HRU International Journal of Dentistry and Oral Research 2025;5(2):94 View
Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
Saowaprut P, Wabina R, Yang J, Siriwat L. Performance of large language models on Thailand’s national medical licensing examination: a cross-sectional study. Journal of Educational Evaluation for Health Professions 2025;22:16 View
Zubaer A, Granitzer M, Geschwind S, Graf Lambsdorff J, Voss D. GPT-4 shows comparable performance to human examiners in ranking open-text answers. Scientific Reports 2025;15(1) View
Fan K, Gan J, Zou I, Kaladjiska M, Inguanez M, Garden G. Poor Performance of Large Language Models Based on the Diabetes and Endocrinology Specialty Certificate Examination of the United Kingdom. Cureus 2025 View
Gaddis G. Artificial Intelligence and the practice of emergency medicine. Emergency Medical Service 2025;12(3):121 View
Kasagga A, Sapkota A, Changaramkumarath G, Abucha J, Wollel M, Somannagari N, Husami M, Hailu K, Kasagga E. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus 2025 View
Nakaura T, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Oda S, Funama Y, Hirai T. Large Language Model Cost and Performance: A Comprehensive Analysis in the Context of the Japan Radiology Board Examination. Journal of Computer Assisted Tomography 2025 View
Chen Y, Wen B, Zulkernine F. A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study. JMIR AI 2025;4:e75932 View
Simoni J, Urtubia-Fernandez J, Mengual E, Simoni D, Royo M, Egaña-Yin D, Hertog O, López-Ortiz L, Muñoz-Tomás A, Santiago-Martínez P, Vahamaki A, Pereira J. Artificial intelligence in undergraduate medical education: an updated scoping review. BMC Medical Education 2025;25(1) View
Punnen T, Shan K, Patel M, McCreary M, Tran D, Santoyo J, Burgess K, Moog T, Smith A, Okuda D. Diagnostic accuracy and bias in open access and subscription-based large language models for multiple sclerosis and neuromyelitis optica spectrum disorder. Intelligence-Based Medicine 2025;12:100314 View
Li X, Li G, Zhao Y, Liang Y, Dong Y, Zhang J. Exploring and Comparing the Use of Large Language Models in Supporting Osteoporosis Health Consultations. Clinical Interventions in Aging 2025;Volume 20:2133 View
Inojosa H, Ramezanzadeh A, Gasparovic-Curtini I, Wiest I, Kather J, Gilbert S, Ziemssen T. Education Research: Can Large Language Models Match MS Specialist Training?. Neurology Education 2025;4(4) View
Santana Rizzi J, Silva Requena L, Bicudo A, Hamamoto Filho P, Ferretti R. Chatbot Underperformance in Biology and Image-Based Questions in Medical Education. Journal of CME 2025;14(1) View
Meyer A, Schömig E, Streichert T. ChatGPT and reference intervals: a comparative analysis of repeatability in GPT-3.5 Turbo, GPT-4, and GPT-4o. Frontiers in Artificial Intelligence 2025;8 View
Böke A, Hacker H, Chakraborty M, Baumeister-Lingens L, Vöckel J, Koenig J, Vogel D, Lichtenstein T, Vogeley K, Kambeitz-Ilankovic L, Kambeitz J. Observer-Independent Assessment of Content Overlap in Mental Health Questionnaires: Large Language Model–Based Study. JMIR AI 2025;4:e79868 View
Mateen A, Kumar V, Singh A, Yadav B, Mahto M, Hassan A, Nasir N. Impact of generative AI in medical education in India: a systematic review. Frontiers in Artificial Intelligence 2025;8 View
Meyer A, Karay Y, Steinbicker A, Streichert T, Overbeek R. Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation. JMIR Formative Research 2025;9:e77357 View
Ignjatović A, Apostolović M, Stevanović L, Radovanović P, Sidharth , Topalović M, Filipović T. Exploring Medical Students’ Perceptions Regarding ChatGPT and AI Studying at the University of Niš: A Study on Usage, Attitudes, and Linguistic Influence—Single-Centered Study in Serbia—A Paradoxical Ally?. Journal of Medical Education and Curricular Development 2025;12 View
Wang B, Zhang M, Wang Z, Yao K, Hao M, Wang J, Peng S, Zhu Y. Supporting postgraduate exam preparation with large language models: implications for traditional Chinese medicine education. Frontiers in Medicine 2026;12 View
Benito P, Isla-Jover M, González-Castro P, Fernández Esparcia P, Carpio M, Blay-Simón I, Gutiérrez-Bedia P, Lapastora M, Carratalá B, Carazo-Casas C. GPT-4o and OpenAI o1 Performance on the 2024 Spanish Competitive Medical Specialty Access Examination: Cross-Sectional Quantitative Evaluation Study. JMIR Medical Education 2026;12:e75452 View
Dundar Sari M, Sezer B. Comparative performance evaluation of ChatGPT-4 Omni and Gemini Advanced in the Turkish Dentistry Specialization Exam. BMC Medical Education 2026;26(1) View
Lian L, Luo X, Chipusu K, Ashraf M, Wong K, Zhang W. Large Language Models Evaluation of Medical Licensing Examination Using GPT-4.0, ERNIE Bot 4.0, and GPT-4o. Bioengineering 2026;13(1):113 View
He Q, Tan Z, Niu W, Chen D, Zhang X, Qin F, Yuan J. From algorithms to operating room: can large language models master China’s attending anesthesiology exam? A cross-sectional evaluation. International Journal of Surgery 2026;112(1):190 View
Olszewski R, Brzeziński J, Watros K, Rysz J. Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review. Applied Sciences 2026;16(3):1423 View
Stelling H, Kraus A, Grieb G, Güler I. Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS). European Journal of Investigation in Health, Psychology and Education 2026;16(2):23 View
Kako T, Kato D, Iguchi T, Qin S, Ando M, Koseki S, Shibahara H, Motoi H, Isaka R, Ikeda N, Toyoda H, Nakagawa T. Performance evaluation of generative pre-trained transformer on the National Veterinary Licensing Examination in Japan. Scientific Reports 2026;16(1) View

This paper is in the following e-collection/theme issue:

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Journals