Published on in Vol 10 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/50965, first published .
Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Authors of this article:

Annika Meyer1 Author Orcid Image ;   Janik Riese2 Author Orcid Image ;   Thomas Streichert1 Author Orcid Image

Journals

  1. Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia O, Cheungpasitporn W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina 2024;60(3):445 View
  2. Lucas F, Mackie I, d'Onofrio G, Frater J. Responsible use of chatbots to advance the laboratory hematology scientific literature: Challenges and opportunities. International Journal of Laboratory Hematology 2024;46(S1):9 View
  3. Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, Lin A, Zhang J, Luo P. The Evaluation of Generative AI Should Include Repetition to Assess Stability. JMIR mHealth and uHealth 2024;12:e57978 View
  4. Meyer A, Ruthard J, Streichert T. Dear ChatGPT – can you teach me how to program an app for laboratory medicine?. Journal of Laboratory Medicine 2024;48(5):197 View
  5. Kaneda Y, Tayuinosho A, Tomoyose R, Takita M, Hamaki T, Tanimoto T, Ozaki A. Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine. Journal of Evaluation in Clinical Practice 2024;30(6):1017 View
  6. Lee T, Rao A, Campbell D, Radfar N, Dayal M, Khrais A. Evaluating ChatGPT-3.5 and ChatGPT-4.0 Responses on Hyperlipidemia for Patient Education. Cureus 2024 View
  7. Meyer A, Soleman A, Riese J, Streichert T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clinical Chemistry and Laboratory Medicine (CCLM) 2024;62(12):2425 View
  8. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2024;26:e60807 View
  9. Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland’s medical final examination—Is it possible for ChatGPT to become a doctor in Poland?. SAGE Open Medicine 2024;12 View
  10. Nicikowski J, Szczepański M, Miedziaszczyk M, Kudliński B. The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland. Clinical Kidney Journal 2024;17(8) View
  11. Brandtzaeg P, Skjuve M, Følstad A. Understanding model power in social AI. AI & SOCIETY 2025;40(4):2839 View
  12. Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024;15(9):543 View
  13. Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg B, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Discover Applied Sciences 2024;6(10) View
  14. Fan K, Fan K. Dermatological Knowledge and Image Analysis Performance of Large Language Models Based on Specialty Certificate Examination in Dermatology. Dermato 2024;4(4):124 View
  15. Pillai J, Pillai K. ChatGPT as a medical education resource in cardiology: Mitigating replicability challenges and optimizing model performance. Current Problems in Cardiology 2024;49(12):102879 View
  16. Omar M, Nadkarni G, Klang E, Glicksberg B, Silva J. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS Digital Health 2024;3(11):e0000662 View
  17. Maraqa N, Samargandi R, Poichotte A, Berhouet J, Benhenneda R. Comparing performances of french orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery. Orthopaedics & Traumatology: Surgery & Research 2025;111(8):104080 View
  18. Rhyu K. The Surge of Artificial Intelligence (AI) in Scientific Writing: Who Will Hold the Rudder, You or AI?. Hip & Pelvis 2024;36(4):231 View
  19. Syed S, Ahmed R, Iqbal A, Ahmad N, Alshara M. MediScan: A Framework of U-Health and Prognostic AI Assessment on Medical Imaging. Journal of Imaging 2024;10(12):322 View
  20. Lukac S, Griewing S, Leinert E, Dayan D, Heitmeir B, Wallwiener M, Janni W, Fink V, Ebner F. ChatGPT, Google, or PINK? Who Provides the Most Reliable Information on Side Effects of Systemic Therapy for Early Breast Cancer?. Clinics and Practice 2024;15(1):8 View
  21. Maraqa N, Samargandi R, Poichotte A, Berhouet J, Benhenneda R. Comparaison des performances des internes français de chirurgie orthopédique et de l’intelligence artificielle ChatGPT-4/4o aux examens du diplôme d’études spécialisées de chirurgie orthopédique et traumatologique. Revue de Chirurgie Orthopédique et Traumatologique 2025 View
  22. Qiu Y, Liu C. Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment. Global Medical Education 2025 View
  23. Kim J, Vajravelu B. Assessing the Current Limitations of Large Language Models in Advancing Health Care Education. JMIR Formative Research 2025;9:e51319 View
  24. Mustuloğlu Ş, Deniz B. Evaluation of Chatbots in the Emergency Management of Avulsion Injuries. Dental Traumatology 2025;41(4):437 View
  25. Bany Abdelnabi A, Soykan B, Bhatti D, Rabadi G. Usefulness of Large Language Models (LLMs) for Student Feedback on H&P During Clerkship: Artificial Intelligence for Personalized Learning. ACM Transactions on Computing for Healthcare 2025 View
  26. Gehring D, Titus S, George R. The Perceived Concerns of Nurse Educators' Use of GenAI in Nursing Education: Protocol for a Scoping Review. Health Science Reports 2025;8(2) View
  27. Barr A, Quan J, Guo E, Sezgin E. Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data. Frontiers in Artificial Intelligence 2025;8 View
  28. Meyer A, Wetsch W, Steinbicker A, Streichert T. Through ChatGPT’s Eyes: The Large Language Model’s Stereotypes and what They Reveal About Healthcare. Journal of Medical Systems 2025;49(1) View
  29. Altalla’ B, Ahmad A, Bitar L, Al-Bssol M, Al Omari A, Sultan I, Sarkar S. Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis. International Journal of Biomedical Imaging 2025;2025(1) View
  30. Fajt B, Schiller E. ChatGPT in Academia: University Students’ Attitudes Towards the use of ChatGPT and Plagiarism. Journal of Academic Ethics 2025;23(3):1363 View
  31. Hallquist E, Gupta I, Montalbano M, Loukas M. Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus 2025 View
  32. Dobbins N. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Research Synthesis Methods 2025;16(3):479 View
  33. Jongkind R, Elings E, Joukes E, Broens T, Leopold H, Wiesman F, Meinema J. Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study. MedEdPublish 2025;15:11 View
  34. Kaster L, Hillis E, Oh I, Aravamuthan B, Lanzotti V, Vickstrom C, Wasserstein M, Chopra M, Sahin M, Wangler M, Schultz B, Izumi K, Bergner S, Gropman A, Smith-Hicks C, Abbeduto L, Hazlett H, Doherty D, German K, DaWalt L, Neul J, Constantino J, Baldridge D, Srivastava S, Molholm S, Walkley S, Storch E, Samaco R, Cohen J, Shankar S, Piven J, Mahida S, Sveden A, Dies K, Riggs E, Savatt J, Minor B, Gurnett C, Payne P, Gupta A. Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models. Journal of Neurodevelopmental Disorders 2025;17(1) View
  35. Nakaura T, Takamure H, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Funama Y, Hirai T. Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions. Academic Radiology 2025;32(8):4347 View
  36. Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
  37. Göçer Gürok N, Öztürk S. The Performance of AI in Dermatology Exams: The Exam Success and Limits of ChatGPT. Journal of Cosmetic Dermatology 2025;24(5) View
  38. Wu H, Zerner T, Lee D, Court-Kowalski S, Devitt P, Palmer E. GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality. Medical Teacher 2025:1 View
  39. Wu Y, Wu Y, Chang Y, Yu C, Wu C, Sung W, Atoum I. Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams. PLOS One 2025;20(6):e0324841 View
  40. Hirosawa T, Yokose M, Sakamoto T, Harada Y, Tokumasu K, Mizuta K, Shimizu T. Utility of Generative Artificial Intelligence for Japanese Medical Interview Training: Randomized Crossover Pilot Study. JMIR Medical Education 2025;11:e77332 View
  41. Meyer B, Kfuri‐Rubens R, Schmidt G, Tariq M, Riedel C, Recker F, Riedel F, Kiechle M, Riedel M. Exploring the potential of AI‐powered applications for clinical decision‐making in gynecologic oncology. International Journal of Gynecology & Obstetrics 2025;171(2):698 View
  42. Amini M, Chang P, Davis R, Nguyen D, Dodge J, Phan J, Buxbaum J, Sahakian A. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings. Endoscopy International Open 2025;13(CP) View
  43. Ługowski F, Babińska J, Ludwin A, Stanirowski P. Comparative analysis of ChatGPT 3.5 and ChatGPT 4 obstetric and gynecological knowledge. Scientific Reports 2025;15(1) View
  44. Stenseke J. Counter-productivity and suspicion: two arguments against talking about the AGI control problem. Philosophical Studies 2025 View
  45. Feitosa Filho H, Furtado J, Eulálio E, Ribeiro P, Paiva L, Correia M, Silva Júnior G. ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil. Brazilian Journal of Nephrology 2025;47(4) View
  46. Feitosa Filho H, Furtado J, Eulálio E, Ribeiro P, Paiva L, Correia M, Silva Júnior G. Desempenho do ChatGPT na resposta a questões de residência médica em Nefrologia: um estudo piloto no Brasil. Brazilian Journal of Nephrology 2025;47(4) View
  47. Mavrych V, Yousef E, Yaqinuddin A, Bolgova O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Medical Education Online 2025;30(1) View
  48. George R, Titus S, Gehring D. Nurse Educators' Concerns of GenAI in Education: Scoping Review of Technical Factors. Journal of Nursing Education 2025;64(8):503 View
  49. Nakaura T, Uetani H, Yoshida N, Kobayashi N, Nagayama Y, Kidoh M, Kuroda J, Mukasa A, Hirai T. Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images. European Radiology 2025 View
  50. Polat M, Odabaşı O. Scientific Creativity of Artificial Intelligence: Evaluation of Novel Research Ideas in Oral and Maxillofacial Surgery. HRU International Journal of Dentistry and Oral Research 2025;5(2):94 View
  51. Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
  52. Saowaprut P, Wabina R, Yang J, Siriwat L. Performance of large language models on Thailand’s national medical licensing examination: a cross-sectional study. Journal of Educational Evaluation for Health Professions 2025;22:16 View
  53. Zubaer A, Granitzer M, Geschwind S, Graf Lambsdorff J, Voss D. GPT-4 shows comparable performance to human examiners in ranking open-text answers. Scientific Reports 2025;15(1) View
  54. Fan K, Gan J, Zou I, Kaladjiska M, Inguanez M, Garden G. Poor Performance of Large Language Models Based on the Diabetes and Endocrinology Specialty Certificate Examination of the United Kingdom. Cureus 2025 View
  55. Gaddis G. Artificial Intelligence and the practice of emergency medicine. Emergency Medical Service 2025;12(3):121 View
  56. Kasagga A, Sapkota A, Changaramkumarath G, Abucha J, Wollel M, Somannagari N, Husami M, Hailu K, Kasagga E. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus 2025 View
  57. Nakaura T, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Oda S, Funama Y, Hirai T. Large Language Model Cost and Performance: A Comprehensive Analysis in the Context of the Japan Radiology Board Examination. Journal of Computer Assisted Tomography 2025 View
  58. Chen Y, Wen B, Zulkernine F. A Multi-agent Summarization and Auto-evaluation (MASA) Framework for Medical Text: Development and Evaluation Study (Preprint). JMIR AI 2025 View
  59. Simoni J, Urtubia-Fernandez J, Mengual E, Simoni D, Royo M, Egaña-Yin D, Hertog O, López-Ortiz L, Muñoz-Tomás A, Santiago-Martínez P, Vahamaki A, Pereira J. Artificial intelligence in undergraduate medical education: an updated scoping review. BMC Medical Education 2025;25(1) View
  60. Punnen T, Shan K, Patel M, McCreary M, Tran D, Santoyo J, Burgess K, Moog T, Smith A, Okuda D. Diagnostic accuracy and bias in open access and subscription-based large language models for multiple sclerosis and neuromyelitis optica spectrum disorder. Intelligence-Based Medicine 2025;12:100314 View
  61. Li X, Li G, Zhao Y, Liang Y, Dong Y, Zhang J. Exploring and Comparing the Use of Large Language Models in Supporting Osteoporosis Health Consultations. Clinical Interventions in Aging 2025;Volume 20:2133 View
  62. Inojosa H, Ramezanzadeh A, Gasparovic-Curtini I, Wiest I, Kather J, Gilbert S, Ziemssen T. Education Research: Can Large Language Models Match MS Specialist Training?. Neurology Education 2025;4(4) View