Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study

doi:10.2196/52202

Published on 06.Dec.2023 in Vol 9 (2023)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/52202, first published 25.Aug.2023.

Asian businesswoman working late on laptop in modern office

Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study

Takashi Watari^{1, 2, 3}

; Soshi Takagi⁴

; Kota Sakaguchi¹

; Yuji Nishizaki⁵

; Taro Shimizu⁶

; Yu Yamamoto⁷

; Yasuharu Tokuda⁸

Article Authors Cited by (45) Tweetations (3) Metrics

Journals

Noda M, Ueno T, Koshu R, Takaso Y, Shimada M, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, Yoshizaki T. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Medical Education 2024;10:e57054 View
Gravina A, Pellegrino R, Palladino G, Imperio G, Ventura A, Federico A. Charting new AI education in gastroenterology: Cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam. Digestive and Liver Disease 2024;56(8):1304 View
Wang S, Mo C, Chen Y, Dai X, Wang H, Shen X. Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care. JMIR Medical Education 2024;10:e55595 View
GURBUZ D, VARIS E. Is ChatGPT knowledgeable of acute coronary syndromes and pertinent European Society of Cardiology Guidelines?. Minerva Cardiology and Angiology 2024;72(3) View
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2024;26:e60807 View
Takahashi H, Shikino K, Kondo T, Komori A, Yamada Y, Saita M, Naito T. Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study. JMIR Medical Education 2024;10:e59133 View
Sallam M, Al-Mahzoum K, Alshuaib O, Alhajri H, Alotaibi F, Alkhurainej D, Al-Balwah M, Barakat M, Egger J. Language discrepancies in the performance of generative artificial intelligence models: an examination of infectious disease queries in English and Arabic. BMC Infectious Diseases 2024;24(1) View
Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, Furukawa E, Kiuchi T. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. International Journal of Medical Informatics 2025;193:105673 View
Eoh K, Kwon G, Lee E, Lee J, Lee I, Kim Y, Nam E. Efficacy of large language models and their potential in Obstetrics and Gynecology education. Obstetrics & Gynecology Science 2024;67(6):550 View
Ho C, Tian T, Ayers A, Aaron R, Phillips V, Wolf R, Mathioudakis N, Dai T, Klonoff D. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Medical Informatics and Decision Making 2024;24(1) View
Huang T, Hsieh P, Chang Y. Performance Comparison of Junior Residents and ChatGPT in the Objective Structured Clinical Examination (OSCE) for Medical History Taking and Documentation of Medical Records: Development and Usability Study. JMIR Medical Education 2024;10:e59902 View
Burisch C, Bellary A, Breuckmann F, Ehlers J, Thal S, Sellmann T, Gödde D. ChatGPT-4 Performance on German Continuing Medical Education—Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial. JMIR Research Protocols 2025;14:e63887 View
Fukushima T, Manabe M, Yada S, Wakamiya S, Yoshida A, Urakawa Y, Maeda A, Kan S, Takahashi M, Aramaki E. Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset. JMIR Medical Informatics 2025;13:e65047 View
Xiao J, Li M, Cai R, Huang H, Yu H, Huang L, Li J, Yu T, Zhang J, Cheng S. Smart Pharmaceutical Monitoring System With Personalized Medication Schedules and Self-Management Programs for Patients With Diabetes: Development and Evaluation Study. Journal of Medical Internet Research 2025;27:e56737 View
Gungor N, Esen F, Tasci T, Gungor K, Cil K. Navigating Gynecological Oncology with Different Versions of ChatGPT: A Transformative Breakthrough or the Next Black Box Challenge?. Oncology Research and Treatment 2024;48(3):102 View
Ye H, Xu J, Huang D, Xie M, Guo J, Yang J, Bao H, Zhang M, Zheng C. Assessment of large language models’ performances and hallucinations for Chinese postgraduate medical entrance examination. Discover Education 2025;4(1) View
Tseng L, Lu Y, Tseng L, Chen Y, Chen H. Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study. JMIR Medical Education 2025;11:e58897 View
Matsutomo N, Fukami M, Yamamoto T. Can interactive artificial intelligence be used for patient explanations of nuclear medicine examinations in Japanese?. Annals of Nuclear Medicine 2025;39(8):774 View
Aydın A, Reis D. ChatGPT 3.5, ChatGPT 4.0 ve Hemşirelik Öğrencilerinin Çocuk Acillerde Hemşirelik Yaklaşımı Dersi Sınavındaki Performans Karşılaştırmaları. Bandırma Onyedi Eylül Üniversitesi Sağlık Bilimleri ve Araştırmaları Dergisi 2025;7(1):73 View
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
Fukushima M, Eshita S, Fukuhara H. Advancements and limitations of LLMs in replicating human color-word associations. Discover Artificial Intelligence 2025;5(1) View
Liu H, Chen S, Wang W, Lee C, Hsu H, Shen S, Chiou H, Lee W. Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance. Academic Radiology 2025;32(9):4974 View
Meyer B, Kfuri‐Rubens R, Schmidt G, Tariq M, Riedel C, Recker F, Riedel F, Kiechle M, Riedel M. Exploring the potential of AI‐powered applications for clinical decision‐making in gynecologic oncology. International Journal of Gynecology & Obstetrics 2025;171(2):698 View
Silveira J. Comments From the Editor: Generative AI in Research. Update: Applications of Research in Music Education 2025;43(3):3 View
Forster P, Käsbohrer A, Cramer H, Frass M, Maeschli A, Martin D, Panhofer P, Stetina B, Wolf U, Zentek J, Weiermayer P, Thiyagarajan K. CIMUVET-survey: Complementary and Integrative Medicine (CIM) use in veterinary practice in Austria and CIM education at universities in Austria, Germany and Switzerland. PLOS One 2025;20(7):e0327599 View
Feitosa Filho H, Furtado J, Eulálio E, Ribeiro P, Paiva L, Correia M, Silva Júnior G. ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil. Brazilian Journal of Nephrology 2025;47(4) View
Feitosa Filho H, Furtado J, Eulálio E, Ribeiro P, Paiva L, Correia M, Silva Júnior G. Desempenho do ChatGPT na resposta a questões de residência médica em Nefrologia: um estudo piloto no Brasil. Brazilian Journal of Nephrology 2025;47(4) View
Stimmer L, Kuiper R, Polledo L, Ressel L, Rodriguez J, Veiga I, Williams J, Herder V. Natural language processing in veterinary pathology: A review. Veterinary Pathology 2025;62(6):829 View
Gilardi N, Ballabio M, Ravera F, Ferrando L, Stabile M, Bellodi A, Talerico G, Cigolini B, Genova C, Carbone F, Montecucco F, Bracco C, Ballestrero A, Zoppoli G. Influence of medical educational background on the diagnostic quality of ChatGPT‐4 responses in internal medicine: A pilot study. European Journal of Clinical Investigation 2025;55(11) View
Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
Lin Y, Luo Z, Ye Z, Zhong N, Zhao L, Zhang L, Li X, Chen Z, Chen Y. Applications, Challenges, and Prospects of Generative Artificial Intelligence Empowering Medical Education: Scoping Review. JMIR Medical Education 2025;11:e71125 View
Chan M, Tjio C, Chan T, Tan Y, Chua A, Loh S, Leow G, Gan M, Lim X, Choo A, Liu Y, Tan J, Teo E, Yap Q, Yonghan T, Makmur A, Kumar N, Tan J, Hallinan J. Large Language Model (LLM)-Predicted and LLM-Assisted Calculation of the Spinal Instability Neoplastic Score (SINS) Improves Clinician Accuracy and Efficiency. Cancers 2025;17(19):3198 View
Shaikh Y, Jeelani-Shaikh Z, Jeelani M, Javaid A, Mahmud T, Gaglani S, Gibbons M, Cheema M, Cross A, Livingston D, Cheatham M, Nezami E, Dixon R, Niranjan-Azadi A, Zafar S, Siddiqui Z, Villanueva C. Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE. PLOS Digital Health 2025;4(10):e0000787 View
Sun R, Hu X, Shao Y, Luo Z, Liu B, Cheng Y. Using Large Language Models to Analyze Interviews for Driver Psychological Assessment: A Performance Comparison of ChatGPT and Google-Gemini. Symmetry 2025;17(10):1713 View
Warlick A, Clifton C, Trinh T, Kaur R, Weinberg A, Collins J. Integrating a chatbot into simulation-based perfusion training: A pilot randomized controlled trial. Perfusion 2025 View
Aphale P, Shekhar H, Dokania S. From Accuracy to Applicability: Rethinking Large Language Model Integration in Radiology Exam Design. Academic Radiology 2026;33(2):359 View
Qi B, Zheng Y, Wang Y, Xu L. Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study. JMIR Formative Research 2025;9:e79534 View
Kaleci A, Şahinbaş B, Ağadayı E, Çelikkaya S, Altun A, Kardan E. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. Tıp Eğitimi Dünyası 2025;24(74):135 View
Liu M, Okuhara T, Dai Z, Zhao M, Yin W, Okada H, Furukawa E, Kiuchi T. Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination. BMC Medical Informatics and Decision Making 2026;26(1) View
Lokadjaja M, Kho J, Schulz P, Goh W. Large Language Models and Their Applications in Mental Health: Scoping Review. JMIR Mental Health 2026;13:e88057 View
Hornback A, Sathu H, Kim K, Wang Y, Zhu Y, Isgut M, Avula P, Khimani A, Wang M. Large language models in healthcare and biomedical informatics: A comprehensive review. Innovation and Emerging Technologies 2026;13 View
Ren K, Weng Q, Chen Q, Li H, Xie D, Zeng C, Wei J, Lei G, Wang Y. The application of large language models in orthopedic postgraduate education: potentials, challenges, and future prospects. Journal of Orthopaedic Surgery and Research 2026;21(1) View
Galindo A, Saadi R, Kerman T, Schwartzman D, Pasternak D, Kinori M, Loewenstein A. A large comprehensive comparison of large language models on ophthalmology board exams. Graefe's Archive for Clinical and Experimental Ophthalmology 2026 View
Özmen L, Burisch C, Gödde D, Breuckmann F, Ehlers J, Sellmann T. Large Language Models in German Continuing Medical Education Assessments: Protocol for a Fully Crossed Experimental Study. JMIR Research Protocols 2026;15:e91675 View
Acar A, Tekirdaş E. Evaluation of GPT-4o and GPT o1 pro in answering Turkish Neurosurgical Society proficiency board exam questions: A comparative study. Pamukkale Medical Journal 2026;19(3):488 View

Citation

Please cite as:

Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, Tokuda Y
Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study
JMIR Med Educ 2023;9:e52202
doi: 10.2196/52202 PMID: 38055323 PMCID: 10733815

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Artificial Intelligence (AI) in Medical Education (709) e-Learning and Digital Medical Education (1555) Research Instruments, Questionnaires, and Tools (1179) New Methods and Approaches in Medical Education (621) Chatbots and Conversational Agents (1150) Generative Language Models Including ChatGPT (1457)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn