Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

doi:10.2196/52784

Published on 13.Aug.2024 in Vol 10 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/52784, first published 15.Sep.2023.

Healthcare professionals interact with a pink and white medical robot in a hospital setting.

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Shuai Ming^{1, 2, 3}

; Qingge Guo^{1, 2, 3}

; Wenjun Cheng⁴

; Bo Lei^{1, 2, 3}

Article Authors Cited by (19) Tweetations Metrics

Journals

Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024;15(9):543 View
Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg B, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Discover Applied Sciences 2024;6(10) View
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
Wang W, Fu J, Zhang Y, Hu K. A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam. Journal of Cancer Education 2026;41(2):256 View
Wu J, Wang Z, Qin Y. Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study. Journal of Medical Systems 2025;49(1) View
Cheng Y, Zhu L. A review of ChatGPT in medical education: exploring advantages and limitations. International Journal of Surgery 2025;111(7):4586 View
Schwarzkopf S, Bereuter J, Geissler M, Weitz J, Distler M, Kolbinger F, Berens P. Postoperative complication management: How do large language models measure up to human expertise?. PLOS Digital Health 2025;4(8):e0000933 View
Zhang Y, Xie X, Xu Q. ChatGPT in Medical Education: Bibliometric and Visual Analysis. JMIR Medical Education 2025;11:e72356 View
Ming S, Yao X, Guo Q, Chen D, Guo X, Xie K, Lei B. Evaluation of DeepSeek-R1 for Ophthalmic Diagnosis and Reasoning: A Comparison with OpenAI o1 and o3. Journal of Medical Systems 2025;49(1) View
Kasagga A, Sapkota A, Changaramkumarath G, Abucha J, Wollel M, Somannagari N, Husami M, Hailu K, Kasagga E. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus 2025 View
Inojosa H, Ramezanzadeh A, Gasparovic-Curtini I, Wiest I, Kather J, Gilbert S, Ziemssen T. Education Research: Can Large Language Models Match MS Specialist Training?. Neurology Education 2025;4(4) View
Zhang J, Ubuzima P, Huang G, Lee E, Wang Y, Xu H, Xia L, Wu T. From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination. BMC Oral Health 2025;25(1) View
Li X, Hu X, Xu H, Sun Z, Yu P, Ju H, Zhang Z. Performance of DeepSeek and ChatGPT on the Chinese Health Professional and Technical Examination: A comparative study. PLOS One 2026;21(1):e0338328 View
He Q, Tan Z, Niu W, Chen D, Zhang X, Qin F, Yuan J. From algorithms to operating room: can large language models master China’s attending anesthesiology exam? A cross-sectional evaluation. International Journal of Surgery 2026;112(1):190 View
Tang Y, Chen J, Wang S, Karobari M. Performance benchmarking of LLMs on Chinese national medical licensing education: Cross-lingual and question-type effects. PLOS One 2026;21(4):e0346518 View
Zeng Y, Hu X, Liu W, Deng K, Zhou M, Wang Y, Ma L, Liu Q, Meng H. Large language models as data-driven engines for benchmarking preventive and clinical knowledge in Chinese dental examinations. Frontiers in Oral Health 2026;7 View
Niu Z, Tang D, Chen J, Zhang P, Zhu C. Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns. Frontiers in Digital Health 2026;8 View
Gülhan Güner S, Tan Z, Gülpınar S. Comparative performance of artificial intelligence models in intensive care nursing questions: an evaluation of ChatGPT, DeepSeek, and Google Gemini. BMC Nursing 2026;25(1) View
Wang Z, Qin Y, Wu J. Performance stability despite iteration: evaluating DeepSeek and ChatGPT on Chinese medical licensing examinations. Frontiers in Medicine 2026;13 View

This paper is in the following e-collection/theme issue:

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Journals