%0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e54393 %T Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study %A Nakao,Takahiro %A Miki,Soichiro %A Nakamura,Yuta %A Kikuchi,Tomohiro %A Nomura,Yukihiro %A Hanaoka,Shouhei %A Yoshikawa,Takeharu %A Abe,Osamu %+ Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan, 81 358008666, tanakao-tky@umin.ac.jp %K AI %K artificial intelligence %K LLM %K large language model %K language model %K language models %K ChatGPT %K GPT-4 %K GPT-4V %K generative pretrained transformer %K image %K images %K imaging %K response %K responses %K exam %K examination %K exams %K examinations %K answer %K answers %K NLP %K natural language processing %K chatbot %K chatbots %K conversational agent %K conversational agents %K medical education %D 2024 %7 12.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V’s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. %M 38470459 %R 10.2196/54393 %U https://mededu.jmir.org/2024/1/e54393 %U https://doi.org/10.2196/54393 %U http://www.ncbi.nlm.nih.gov/pubmed/38470459