%0 Journal Article
%@ 2369-3762
%I JMIR Publications
%V 10
%N 
%P e54393
%T Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study
%A Nakao,Takahiro
%A Miki,Soichiro
%A Nakamura,Yuta
%A Kikuchi,Tomohiro
%A Nomura,Yukihiro
%A Hanaoka,Shouhei
%A Yoshikawa,Takeharu
%A Abe,Osamu
%+ Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan, 81 358008666, tanakao-tky@umin.ac.jp
%K AI
%K artificial intelligence
%K LLM
%K large language model
%K language model
%K language models
%K ChatGPT
%K GPT-4
%K GPT-4V
%K generative pretrained transformer
%K image
%K images
%K imaging
%K response
%K responses
%K exam
%K examination
%K exams
%K examinations
%K answer
%K answers
%K NLP
%K natural language processing
%K chatbot
%K chatbots
%K conversational agent
%K conversational agents
%K medical education
%D 2024
%7 12.3.2024
%9 Original Paper
%J JMIR Med Educ
%G English
%X Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V’s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. 
%M 38470459
%R 10.2196/54393
%U https://mededu.jmir.org/2024/1/e54393
%U https://doi.org/10.2196/54393
%U http://www.ncbi.nlm.nih.gov/pubmed/38470459