Published on in Vol 11 (2025)
Preprints (earlier versions) of this paper are
available at
https://preprints.jmir.org/preprint/58898, first published
.

Journals
- Agarwal M, Sharma P, Wani P. Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology. Cureus 2025 View
- Shean R, Shah T, Pandiarajan A, Tang A, Bolo K, Nguyen V, Xu B. A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions. Scientific Reports 2025;15(1) View
- Sudo H, Noborimoto Y, Takahashi J. Evaluation of Few-Shot AI-Generated Feedback on Case Reports in Physical Therapy Education: Mixed Methods Study. JMIR Medical Education 2025;11:e85614 View
- Reis F, Agha-Mir-Salim L, Hickstein R, Reis M, Piper S, Balzer F, Boie S. Disclaimers and Referral Patterns for Medical Advice Across Urgency Levels: A Large Language Model Evaluation Study (Preprint). Journal of Medical Internet Research 2025 View
- Chen K, Rogers K, Haberkorn W, Lew M, Kanegan J, Nam H, Chantra J, Asch S, Lee G. AI-driven analysis of patient safety reports using large language models: an exploratory multiple methods study. BMJ Quality & Safety 2026:bmjqs-2025-019495 View
- El Natour D, Abou Alfa M, Chaaban A, Assi R, Dally T, Bou Dargham B. Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study (Preprint). JMIR AI 2025 View
