Evaluating the Performance of DeepSeek-R1 and DeepSeek-V3 Versus OpenAI Models in the Chinese National Medical Licensing Examination: Cross-Sectional Comparative Study

doi:10.2196/73469

Published on 14.Nov.2025 in Vol 11 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/73469, first published 05.Mar.2025.

Young woman in glasses with hands on temples, looking stressed while studying.

Evaluating the Performance of DeepSeek-R1 and DeepSeek-V3 Versus OpenAI Models in the Chinese National Medical Licensing Examination: Cross-Sectional Comparative Study

Weiping Wang¹

; Yuchen Zhou^{1, 2}

; Jingxuan Fu³

; Ke Hu¹

Article Authors Cited by (10) Tweetations Metrics

Journals

Stoyanov A, Nedelcheva A. Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons. Sci 2025;7(4):183 View
Sudo H, Noborimoto Y, Takahashi J. Evaluation of Few-Shot AI-Generated Feedback on Case Reports in Physical Therapy Education: Mixed Methods Study. JMIR Medical Education 2025;11:e85614 View
Liang C, Ghassemiazghandi M. Error analysis of large language model-generated film subtitles using the FAR model. Cogent Arts & Humanities 2026;13(1) View
Cheng X, Chen J, Yang D, Peng Y, Gong S. Comparative performance of ChatGPT and DeepSeek in interpreting the 2025 ESICM guidelines on sepsis fluid therapy. DIGITAL HEALTH 2026;12 View
Lu Z, Cao H, Ma C, Zheng J, Ma X. Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models (Preprint). JMIR Medical Informatics 2026 View
Niu Z, Tang D, Chen J, Zhang P, Zhu C. Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns. Frontiers in Digital Health 2026;8 View
Ma L, Mao K, Shi J, Bateni S, Guo Z, Yuan Z. An optimization AI model based on MoE for retrieving land surface temperature and emissivity. International Journal of Applied Earth Observation and Geoinformation 2026;152:105438 View
Wang Z, Qin Y, Wu J. Performance stability despite iteration: evaluating DeepSeek and ChatGPT on Chinese medical licensing examinations. Frontiers in Medicine 2026;13 View
Li M, Yu Y, Li G, Zhang X, Shi Y, Su R. Large language models for breast cancer treatment planning: a blinded real-world evaluation of DeepSeek, ChatGPT, and oncologist recommendations. Frontiers in Digital Health 2026;8 View
Huang C, Sun Y, Liu W. A comparative study of the performance of different large language models in the Chinese National Pharmacist Licensing Examination. Frontiers in Medicine 2026;13 View

Citation

Please cite as:

Wang W, Zhou Y, Fu J, Hu K
Evaluating the Performance of DeepSeek-R1 and DeepSeek-V3 Versus OpenAI Models in the Chinese National Medical Licensing Examination: Cross-Sectional Comparative Study
JMIR Med Educ 2025;11:e73469
doi: 10.2196/73469 PMID: 41237388 PMCID: 12663704

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

New Resources for Medical Education (199) Testing and Assessment in Medical Education (201) Artificial Intelligence (AI) in Medical Education (669) Generative Language Models Including ChatGPT (1419)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn