%0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e70420 %T Enhancing AI-Driven Medical Translations: Considerations for Language Concordance %A Quon,Stephanie %A Zhou,Sarah %K letter to the editor %K ChatGPT %K AI %K artificial intelligence %K language %K translation %K health care disparity %K natural language model %K survey %K patient education %K accessibility %K preference %K human language %K communication %K language-concordant care %D 2025 %7 11.4.2025 %9 %J JMIR Med Educ %G English %X %R 10.2196/70420 %U https://mededu.jmir.org/2025/1/e70420 %U https://doi.org/10.2196/70420 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e71721 %T Authors’ Reply: Enhancing AI-Driven Medical Translations: Considerations for Language Concordance %A Teng,Joyce %A Novoa,Roberto Andres %A Aleshin,Maria Alexandrovna %A Lester,Jenna %A Seiger,Kira %A Dzuali,Fiatsogbe %A Daneshjou,Roxana %K ChatGPT %K artificial intelligence %K language %K translation %K health care disparity %K natural language model %K survey %K patient education %K accessibility %K preference %K human language %K communication %K language-concordant care %D 2025 %7 11.4.2025 %9 %J JMIR Med Educ %G English %X %R 10.2196/71721 %U https://mededu.jmir.org/2025/1/e71721 %U https://doi.org/10.2196/71721 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e67244 %T Large Language Models in Biochemistry Education: Comparative Evaluation of Performance %A Bolgova,Olena %A Shypilova,Inna %A Mavrych,Volodymyr %K ChatGPT %K Claude %K Gemini %K Copilot %K biochemistry %K LLM %K medical education %K artificial intelligence %K NLP %K natural language processing %K machine learning %K large language model %K AI %K ML %K comprehensive analysis %K medical students %K GPT-4 %K questionnaire %K medical course %K bioenergetics %D 2025 %7 10.4.2025 %9 %J JMIR Med Educ %G English %X Background: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. Objective: The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots—Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)—against the academic results of medical students in the medical biochemistry course. Methods: We used 200 USMLE (United States Medical Licensing Examination)–style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4‐1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data’s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05. Results: On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students’ performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04). Conclusions: Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment. %R 10.2196/67244 %U https://mededu.jmir.org/2025/1/e67244 %U https://doi.org/10.2196/67244 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67883 %T Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study %A Wei,Bin %A Yao,Lili %A Hu,Xin %A Hu,Yuxiang %A Rao,Jie %A Ji,Yu %A Dong,Zhuoer %A Duan,Yichong %A Wu,Xiaorong %+ Jiangxi Medical College, The First Affiliated Hospital, Nanchang University, No.17 Yongwai Zheng Street, Donghu District, Jiangxi Province, Nanchang, 330000, China, 86 136117093259, wxr98021@126.com %K LLM %K large language models %K ocular myasthenia gravis %K patient education %K China %K effectiveness %K deep learning %K artificial intelligence %K health care %K accuracy %K applicability %K neuromuscular disorder %K extraocular muscles %K ptosis %K diplopia %K ophthalmology %K ChatGPT %K clinical practice %K digital health %D 2025 %7 10.4.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Ocular myasthenia gravis (OMG) is a neuromuscular disorder primarily affecting the extraocular muscles, leading to ptosis and diplopia. Effective patient education is crucial for disease management; however, in China, limited health care resources often restrict patients’ access to personalized medical guidance. Large language models (LLMs) have emerged as potential tools to bridge this gap by providing instant, AI-driven health information. However, their accuracy and readability in educating patients with OMG remain uncertain. Objective: The purpose of this study was to systematically evaluate the effectiveness of multiple LLMs in the education of Chinese patients with OMG. Specifically, the validity of these models in answering patients with OMG-related questions was assessed through accuracy, completeness, readability, usefulness, and safety, and patients’ ratings of their usability and readability were analyzed. Methods: The study was conducted in two phases: 130 choice ophthalmology examination questions were input into 5 different LLMs. Their performance was compared with that of undergraduates, master’s students, and ophthalmology residents. In addition, 23 common patients with OMG-related patient questions were posed to 4 LLMs, and their responses were evaluated by ophthalmologists across 5 domains. In the second phase, 20 patients with OMG interacted with the 2 LLMs from the first phase, each asking 3 questions. Patients assessed the responses for satisfaction and readability, while ophthalmologists evaluated the responses again using the 5 domains. Results: ChatGPT o1-preview achieved the highest accuracy rate of 73% on 130 ophthalmology examination questions, outperforming other LLMs and professional groups like undergraduates and master’s students. For 23 common patients with OMG-related questions, ChatGPT o1-preview scored highest in correctness (4.44), completeness (4.44), helpfulness (4.47), and safety (4.6). GEMINI (Google DeepMind) provided the easiest-to-understand responses in readability assessments, while GPT-4o had the most complex responses, suitable for readers with higher education levels. In the second phase with 20 patients with OMG, ChatGPT o1-preview received higher satisfaction scores than Ernie 3.5 (Baidu; 4.40 vs 3.89, P=.002), although Ernie 3.5’s responses were slightly more readable (4.31 vs 4.03, P=.01). Conclusions: LLMs such as ChatGPT o1-preview may have the potential to enhance patient education. Addressing challenges such as misinformation risk, readability issues, and ethical considerations is crucial for their effective and safe integration into clinical practice. %M 40209226 %R 10.2196/67883 %U https://www.jmir.org/2025/1/e67883 %U https://doi.org/10.2196/67883 %U http://www.ncbi.nlm.nih.gov/pubmed/40209226 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e72998 %T Citation Accuracy Challenges Posed by Large Language Models %A Zhang,Manlin %A Zhao,Tianyu %K chatGPT %K medical education %K Saudi Arabia %K perceptions %K knowledge %K medical students %K faculty %K chatbot %K qualitative study %K artificial intelligence %K AI %K AI-based tools %K universities %K thematic analysis %K learning %K satisfaction %K LLM %K large language model %D 2025 %7 2.4.2025 %9 %J JMIR Med Educ %G English %X %R 10.2196/72998 %U https://mededu.jmir.org/2025/1/e72998 %U https://doi.org/10.2196/72998 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e73698 %T Authors’ Reply: Citation Accuracy Challenges Posed by Large Language Models %A Temsah,Mohamad-Hani %A Al-Eyadhy,Ayman %A Jamal,Amr %A Alhasan,Khalid %A Malki,Khalid H %K ChatGPT %K Gemini %K DeepSeek %K medical education %K AI %K artificial intelligence %K Saudi Arabia %K perceptions %K medical students %K faculty %K LLM %K chatbot %K qualitative study %K thematic analysis %K satisfaction %K RAG retrieval-augmented generation %D 2025 %7 2.4.2025 %9 %J JMIR Med Educ %G English %X %R 10.2196/73698 %U https://mededu.jmir.org/2025/1/e73698 %U https://doi.org/10.2196/73698 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e55709 %T Impact of Clinical Decision Support Systems on Medical Students’ Case-Solving Performance: Comparison Study with a Focus Group %A Montagna,Marco %A Chiabrando,Filippo %A De Lorenzo,Rebecca %A Rovere Querini,Patrizia %A , %K chatGPT %K chatbot %K machine learning %K ML %K artificial intelligence %K AI %K algorithm %K predictive model %K predictive analytics %K predictive system %K practical model %K deep learning %K large language models %K LLMs %K medical education %K medical teaching %K teaching environment %K clinical decision support systems %K CDSS %K decision support %K decision support tool %K clinical decision-making %K innovative teaching %D 2025 %7 18.3.2025 %9 %J JMIR Med Educ %G English %X Background: Health care practitioners use clinical decision support systems (CDSS) as an aid in the crucial task of clinical reasoning and decision-making. Traditional CDSS are online repositories (ORs) and clinical practice guidelines (CPG). Recently, large language models (LLMs) such as ChatGPT have emerged as potential alternatives. They have proven to be powerful, innovative tools, yet they are not devoid of worrisome risks. Objective: This study aims to explore how medical students perform in an evaluated clinical case through the use of different CDSS tools. Methods: The authors randomly divided medical students into 3 groups, CPG, n=6 (38%); OR, n=5 (31%); and ChatGPT, n=5 (31%); and assigned each group a different type of CDSS for guidance in answering prespecified questions, assessing how students’ speed and ability at resolving the same clinical case varied accordingly. External reviewers evaluated all answers based on accuracy and completeness metrics (score: 1‐5). The authors analyzed and categorized group scores according to the skill investigated: differential diagnosis, diagnostic workup, and clinical decision-making. Results: Answering time showed a trend for the ChatGPT group to be the fastest. The mean scores for completeness were as follows: CPG 4.0, OR 3.7, and ChatGPT 3.8 (P=.49). The mean scores for accuracy were as follows: CPG 4.0, OR 3.3, and ChatGPT 3.7 (P=.02). Aggregating scores according to the 3 students’ skill domains, trends in differences among the groups emerge more clearly, with the CPG group that performed best in nearly all domains and maintained almost perfect alignment between its completeness and accuracy. Conclusions: This hands-on session provided valuable insights into the potential perks and associated pitfalls of LLMs in medical education and practice. It suggested the critical need to include teachings in medical degree courses on how to properly take advantage of LLMs, as the potential for misuse is evident and real. %R 10.2196/55709 %U https://mededu.jmir.org/2025/1/e55709 %U https://doi.org/10.2196/55709 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e59210 %T Leveraging Generative Artificial Intelligence to Improve Motivation and Retrieval in Higher Education Learners %A Monzon,Noahlana %A Hays,Franklin Alan %K educational technology %K retrieval practice %K flipped classroom %K cognitive engagement %K personalized learning %K generative artificial intelligence %K higher education %K university education %K learners %K instructors %K curriculum structure %K learning %K technologies %K innovation %K academic misconduct %K gamification %K self-directed %K socio-economic disparities %K interactive approach %K medical education %K chatGPT %K machine learning %K AI %K large language models %D 2025 %7 11.3.2025 %9 %J JMIR Med Educ %G English %X Generative artificial intelligence (GenAI) presents novel approaches to enhance motivation, curriculum structure and development, and learning and retrieval processes for both learners and instructors. Though a focus for this emerging technology is academic misconduct, we sought to leverage GenAI in curriculum structure to facilitate educational outcomes. For instructors, GenAI offers new opportunities in course design and management while reducing time requirements to evaluate outcomes and personalizing learner feedback. These include innovative instructional designs such as flipped classrooms and gamification, enriching teaching methodologies with focused and interactive approaches, and team-based exercise development among others. For learners, GenAI offers unprecedented self-directed learning opportunities, improved cognitive engagement, and effective retrieval practices, leading to enhanced autonomy, motivation, and knowledge retention. Though empowering, this evolving landscape has integration challenges and ethical considerations, including accuracy, technological evolution, loss of learner’s voice, and socioeconomic disparities. Our experience demonstrates that the responsible application of GenAI’s in educational settings will revolutionize learning practices, making education more accessible and tailored, producing positive motivational outcomes for both learners and instructors. Thus, we argue that leveraging GenAI in educational settings will improve outcomes with implications extending from primary through higher and continuing education paradigms. %R 10.2196/59210 %U https://mededu.jmir.org/2025/1/e59210 %U https://doi.org/10.2196/59210 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e62779 %T Detecting Artificial Intelligence–Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study %A Doru,Berin %A Maier,Christoph %A Busse,Johanna Sophie %A Lücke,Thomas %A Schönhoff,Judith %A Enax- Krumova,Elena %A Hessler,Steffen %A Berger,Maria %A Tokic,Marianne %+ University Hospital of Paediatrics and Adolescent Medicine, St. Josef-Hospital, Ruhr University Bochum, Alexandrinenstraße 5, Bochum, 44791, Germany, 49 234 509 2611, Berin.Doru@rub.de %K artificial intelligence %K ChatGPT %K large language models %K textual analysis %K writing style %K AI %K chatbot %K LLMs %K detection %K authorship %K medical student %K textual analysis %K linguistic quality %K decision-making %K logical coherence %D 2025 %7 3.3.2025 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)–generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount. Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity—medical professionals and humanities scholars with expertise in textual analysis—to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features. Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants’ characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text’s authorship. Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features—particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)—played a crucial role in participants’ decisions to identify a text as AI-generated. Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts’ familiarity with the text content. As the decision-making process primarily relies on linguistic attributes—such as stylistic features and text coherence—further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers’ ability to distinguish between student-authored and AI-generated work. %M 40053752 %R 10.2196/62779 %U https://mededu.jmir.org/2025/1/e62779 %U https://doi.org/10.2196/62779 %U http://www.ncbi.nlm.nih.gov/pubmed/40053752 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e63400 %T Perceptions and Earliest Experiences of Medical Students and Faculty With ChatGPT in Medical Education: Qualitative Study %A Abouammoh,Noura %A Alhasan,Khalid %A Aljamaan,Fadi %A Raina,Rupesh %A Malki,Khalid H %A Altamimi,Ibraheem %A Muaygil,Ruaim %A Wahabi,Hayfaa %A Jamal,Amr %A Alhaboob,Ali %A Assiri,Rasha Assad %A Al-Tawfiq,Jaffar A %A Al-Eyadhy,Ayman %A Soliman,Mona %A Temsah,Mohamad-Hani %+ Pediatric Department, King Saud University Medical City, King Saud University, King Abdullah Road, Riyadh, 11424, Saudi Arabia, 966 114692002, mtemsah@ksu.edu.sa %K ChatGPT %K medical education %K Saudi Arabia %K perceptions %K knowledge %K medical students %K faculty %K chatbot %K qualitative study %K artificial intelligence %K AI %K AI-based tools %K universities %K thematic analysis %K learning %K satisfaction %D 2025 %7 20.2.2025 %9 Original Paper %J JMIR Med Educ %G English %X Background: With the rapid development of artificial intelligence technologies, there is a growing interest in the potential use of artificial intelligence–based tools like ChatGPT in medical education. However, there is limited research on the initial perceptions and experiences of faculty and students with ChatGPT, particularly in Saudi Arabia. Objective: This study aimed to explore the earliest knowledge, perceived benefits, concerns, and limitations of using ChatGPT in medical education among faculty and students at a leading Saudi Arabian university. Methods: A qualitative exploratory study was conducted in April 2023, involving focused meetings with medical faculty and students with varying levels of ChatGPT experience. A thematic analysis was used to identify key themes and subthemes emerging from the discussions. Results: Participants demonstrated good knowledge of ChatGPT and its functions. The main themes were perceptions of ChatGPT use, potential benefits, and concerns about ChatGPT in research and medical education. The perceived benefits included collecting and summarizing information and saving time and effort. However, concerns and limitations centered around the potential lack of critical thinking in the information provided, the ambiguity of references, limitations of access, trust in the output of ChatGPT, and ethical concerns. Conclusions: This study provides valuable insights into the perceptions and experiences of medical faculty and students regarding the use of newly introduced large language models like ChatGPT in medical education. While the benefits of ChatGPT were recognized, participants also expressed concerns and limitations requiring further studies for effective integration into medical education, exploring the impact of ChatGPT on learning outcomes, student and faculty satisfaction, and the development of critical thinking skills. %M 39977012 %R 10.2196/63400 %U https://mededu.jmir.org/2025/1/e63400 %U https://doi.org/10.2196/63400 %U http://www.ncbi.nlm.nih.gov/pubmed/39977012 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58766 %T Generative Artificial Intelligence in Medical Education—Policies and Training at US Osteopathic Medical Schools: Descriptive Cross-Sectional Survey %A Ichikawa,Tsunagu %A Olsen,Elizabeth %A Vinod,Arathi %A Glenn,Noah %A Hanna,Karim %A Lund,Gregg C %A Pierce-Talsma,Stacey %K artificial intelligence %K medical education %K faculty development %K policy %K AI %K training %K United States %K school %K university %K college %K institution %K osteopathic %K osteopathy %K curriculum %K student %K faculty %K administrator %K survey %K cross-sectional %D 2025 %7 11.2.2025 %9 %J JMIR Med Educ %G English %X Background: Interest has recently increased in generative artificial intelligence (GenAI), a subset of artificial intelligence that can create new content. Although the publicly available GenAI tools are not specifically trained in the medical domain, they have demonstrated proficiency in a wide range of medical assessments. The future integration of GenAI in medicine remains unknown. However, the rapid availability of GenAI with a chat interface and the potential risks and benefits are the focus of great interest. As with any significant medical advancement or change, medical schools must adapt their curricula to equip students with the skills necessary to become successful physicians. Furthermore, medical schools must ensure that faculty members have the skills to harness these new opportunities to increase their effectiveness as educators. How medical schools currently fulfill their responsibilities is unclear. Colleges of Osteopathic Medicine (COMs) in the United States currently train a significant proportion of the total number of medical students. These COMs are in academic settings ranging from large public research universities to small private institutions. Therefore, studying COMs will offer a representative sample of the current GenAI integration in medical education. Objective: This study aims to describe the policies and training regarding the specific aspect of GenAI in US COMs, targeting students, faculty, and administrators. Methods: Web-based surveys were sent to deans and Student Government Association (SGA) presidents of the main campuses of fully accredited US COMs. The dean survey included questions regarding current and planned policies and training related to GenAI for students, faculty, and administrators. The SGA president survey included only those questions related to current student policies and training. Results: Responses were received from 81% (26/32) of COMs surveyed. This included 47% (15/32) of the deans and 50% (16/32) of the SGA presidents (with 5 COMs represented by both the deans and the SGA presidents). Most COMs did not have a policy on the student use of GenAI, as reported by the dean (14/15, 93%) and the SGA president (14/16, 88%). Of the COMs with no policy, 79% (11/14) had no formal plans for policy development. Only 1 COM had training for students, which focused entirely on the ethics of using GenAI. Most COMs had no formal plans to provide mandatory (11/14, 79%) or elective (11/15, 73%) training. No COM had GenAI policies for faculty or administrators. Eighty percent had no formal plans for policy development. Furthermore, 33.3% (5/15) of COMs had faculty or administrator GenAI training. Except for examination question development, there was no training to increase faculty or administrator capabilities and efficiency or to decrease their workload. Conclusions: The survey revealed that most COMs lack GenAI policies and training for students, faculty, and administrators. The few institutions with policies or training were extremely limited in scope. Most institutions without current training or policies had no formal plans for development. The lack of current policies and training initiatives suggests inadequate preparedness for integrating GenAI into the medical school environment, therefore, relegating the responsibility for ethical guidance and training to the individual COM member. %R 10.2196/58766 %U https://mededu.jmir.org/2025/1/e58766 %U https://doi.org/10.2196/58766 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e63065 %T Assessing Familiarity, Usage Patterns, and Attitudes of Medical Students Toward ChatGPT and Other Chat-Based AI Apps in Medical Education: Cross-Sectional Questionnaire Study %A Elhassan,Safia Elwaleed %A Sajid,Muhammad Raihan %A Syed,Amina Mariam %A Fathima,Sidrah Afreen %A Khan,Bushra Shehroz %A Tamim,Hala %K ChatGPT %K artificial intelligence %K large language model %K medical students %K ethics %K chat-based %K AI apps %K medical education %K social media %K attitude %K AI %D 2025 %7 30.1.2025 %9 %J JMIR Med Educ %G English %X Background: There has been a rise in the popularity of ChatGPT and other chat-based artificial intelligence (AI) apps in medical education. Despite data being available from other parts of the world, there is a significant lack of information on this topic in medical education and research, particularly in Saudi Arabia. Objective: The primary objective of the study was to examine the familiarity, usage patterns, and attitudes of Alfaisal University medical students toward ChatGPT and other chat-based AI apps in medical education. Methods: This was a cross-sectional study conducted from October 8, 2023, through November 22, 2023. A questionnaire was distributed through social media channels to medical students at Alfaisal University who were 18 years or older. Current Alfaisal University medical students in years 1 through 6, of both genders, were exclusively targeted by the questionnaire. The study was approved by Alfaisal University Institutional Review Board. A χ2 test was conducted to assess the relationships between gender, year of study, familiarity, and reasons for usage. Results: A total of 293 responses were received, of which 95 (32.4%) were from men and 198 (67.6%) were from women. There were 236 (80.5%) responses from preclinical students and 57 (19.5%) from clinical students, respectively. Overall, males (n=93, 97.9%) showed more familiarity with ChatGPT compared to females (n=180, 90.09%; P=.03). Additionally, males also used Google Bard and Microsoft Bing ChatGPT more than females (P<.001). Clinical-year students used ChatGPT significantly more for general writing purposes compared to preclinical students (P=.005). Additionally, 136 (46.4%) students believed that using ChatGPT and other chat-based AI apps for coursework was ethical, 86 (29.4%) were neutral, and 71 (24.2%) considered it unethical (all Ps>.05). Conclusions: Familiarity with and usage of ChatGPT and other chat-based AI apps were common among the students of Alfaisal University. The usage patterns of these apps differ between males and females and between preclinical and clinical-year students. %R 10.2196/63065 %U https://mededu.jmir.org/2025/1/e63065 %U https://doi.org/10.2196/63065 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58898 %T Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study %A Kaewboonlert,Naritsaret %A Poontananggul,Jiraphon %A Pongsuwan,Natthipong %A Bhakdisongkhram,Gun %K accuracy %K performance %K artificial intelligence %K AI %K ChatGPT %K large language model %K LLM %K difficulty index %K basic medical science examination %K cross-sectional study %K medical education %K datasets %K assessment %K medical science %K tool %K Google %D 2025 %7 13.1.2025 %9 %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. Objective: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. Methods: We used questions that were closely aligned with the content and topic distribution of Thailand’s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). Results: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%‐92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%‐87.80%), GPT-3.5 at 67.02% (95% CI 61.20%‐72.48%), and Google Bard at 63.83% (95% CI 57.92%‐69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. Conclusions: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item’s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts. %R 10.2196/58898 %U https://mededu.jmir.org/2025/1/e58898 %U https://doi.org/10.2196/58898 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51435 %T ChatGPT May Improve Access to Language-Concordant Care for Patients With Non–English Language Preferences %A Dzuali,Fiatsogbe %A Seiger,Kira %A Novoa,Roberto %A Aleshin,Maria %A Teng,Joyce %A Lester,Jenna %A Daneshjou,Roxana %K ChatGPT %K artificial intelligence %K language %K translation %K health care disparity %K natural language model %K survey %K patient education %K preference %K human language %K language-concordant care %D 2024 %7 10.12.2024 %9 %J JMIR Med Educ %G English %X This study evaluated the accuracy of ChatGPT in translating English patient education materials into Spanish, Mandarin, and Russian. While ChatGPT shows promise for translating Spanish and Russian medical information, Mandarin translations require further refinement, highlighting the need for careful review of AI-generated translations before clinical use. %R 10.2196/51435 %U https://mededu.jmir.org/2024/1/e51435 %U https://doi.org/10.2196/51435 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e59902 %T Performance Comparison of Junior Residents and ChatGPT in the Objective Structured Clinical Examination (OSCE) for Medical History Taking and Documentation of Medical Records: Development and Usability Study %A Huang,Ting-Yun %A Hsieh,Pei Hsing %A Chang,Yung-Chun %K large language model %K medical history taking %K clinical documentation %K simulation-based evaluation %K OSCE standards %K LLM %D 2024 %7 21.11.2024 %9 %J JMIR Med Educ %G English %X Background: This study explores the cutting-edge abilities of large language models (LLMs) such as ChatGPT in medical history taking and medical record documentation, with a focus on their practical effectiveness in clinical settings—an area vital for the progress of medical artificial intelligence. Objective: Our aim was to assess the capability of ChatGPT versions 3.5 and 4.0 in performing medical history taking and medical record documentation in simulated clinical environments. The study compared the performance of nonmedical individuals using ChatGPT with that of junior medical residents. Methods: A simulation involving standardized patients was designed to mimic authentic medical history–taking interactions. Five nonmedical participants used ChatGPT versions 3.5 and 4.0 to conduct medical histories and document medical records, mirroring the tasks performed by 5 junior residents in identical scenarios. A total of 10 diverse scenarios were examined. Results: Evaluation of the medical documentation created by laypersons with ChatGPT assistance and those created by junior residents was conducted by 2 senior emergency physicians using audio recordings and the final medical records. The assessment used the Objective Structured Clinical Examination benchmarks in Taiwan as a reference. ChatGPT-4.0 exhibited substantial enhancements over its predecessor and met or exceeded the performance of human counterparts in terms of both checklist and global assessment scores. Although the overall quality of human consultations remained higher, ChatGPT-4.0’s proficiency in medical documentation was notably promising. Conclusions: The performance of ChatGPT 4.0 was on par with that of human participants in Objective Structured Clinical Examination evaluations, signifying its potential in medical history and medical record documentation. Despite this, the superiority of human consultations in terms of quality was evident. The study underscores both the promise and the current limitations of LLMs in the realm of clinical practice. %R 10.2196/59902 %U https://mededu.jmir.org/2024/1/e59902 %U https://doi.org/10.2196/59902 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51433 %T Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study %A Ehrett,Carl %A Hegde,Sudeep %A Andre,Kwame %A Liu,Dixizi %A Wilson,Timothy %K data augmentation %K large language models %K medical education %K natural language processing %K data security %K ethics %K AI %K artificial intelligence %K data privacy %K medical staff %D 2024 %7 19.11.2024 %9 %J JMIR Med Educ %G English %X Background: Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI’s ChatGPT. Objective: This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods: The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results: The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers’ performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions: The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field. %R 10.2196/51433 %U https://mededu.jmir.org/2024/1/e51433 %U https://doi.org/10.2196/51433 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e54297 %T Using ChatGPT in Nursing: Scoping Review of Current Opinions %A Zhou,You %A Li,Si-Jia %A Tang,Xing-Yi %A He,Yi-Chen %A Ma,Hao-Ming %A Wang,Ao-Qi %A Pei,Run-Yuan %A Piao,Mei-Hua %K ChatGPT %K large language model %K nursing %K artificial intelligence %K scoping review %K generative AI %K nursing education %D 2024 %7 19.11.2024 %9 %J JMIR Med Educ %G English %X Background: Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields. Objective: We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT’s strengths, weaknesses, and the potential impacts it may cause. Methods: This scoping review was conducted following the framework of Arksey and O’Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently. Results: A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on “ChatGPT and nursing education” (20 studies), “ChatGPT and nursing practice” (10 studies), and “ChatGPT and nursing research, writing, and examination” (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields. Conclusions: As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice. %R 10.2196/54297 %U https://mededu.jmir.org/2024/1/e54297 %U https://doi.org/10.2196/54297 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56762 %T Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain %A Ros-Arlanzón,Pablo %A Perez-Sempere,Angel %K artificial intelligence %K ChatGPT %K clinical decision-making %K medical education %K medical knowledge assessment %K OpenAI %D 2024 %7 14.11.2024 %9 %J JMIR Med Educ %G English %X Background: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Objective: This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI’s capabilities and limitations in medical knowledge. Methods: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom’s Taxonomy. Statistical analysis of performance, including the κ coefficient for response consistency, was performed. Results: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher κ coefficient of 0.73, compared to ChatGPT-3.5’s coefficient of 0.69. Conclusions: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4’s performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment. %R 10.2196/56762 %U https://mededu.jmir.org/2024/1/e56762 %U https://doi.org/10.2196/56762 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56128 %T Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study %A Goodings,Anthony James %A Kajitani,Sten %A Chhor,Allison %A Albakri,Ahmad %A Pastrak,Mila %A Kodancha,Megha %A Ives,Rowan %A Lee,Yoo Bin %A Kajitani,Kari %K ChatGPT-4 %K Family Medicine Board Examination %K artificial intelligence in medical education %K AI performance assessment %K prompt engineering %K ChatGPT %K artificial intelligence %K AI %K medical education %K assessment %K observational %K analytical method %K data analysis %K examination %D 2024 %7 8.10.2024 %9 %J JMIR Med Educ %G English %X Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, “AI Family Medicine Board Exam Taker,” designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI’s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4’s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4’s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. %R 10.2196/56128 %U https://mededu.jmir.org/2024/1/e56128 %U https://doi.org/10.2196/56128 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52746 %T Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study %A Wu,Zelin %A Gan,Wenyi %A Xue,Zhaowen %A Ni,Zhengxin %A Zheng,Xiaofei %A Zhang,Yiyi %K artificial intelligence %K ChatGPT %K nursing licensure examination %K nursing %K LLMs %K large language models %K nursing education %K AI %K nursing student %K large language model %K licensing %K observation %K observational study %K China %K USA %K United States of America %K auxiliary tool %K accuracy rate %K theoretical %D 2024 %7 3.10.2024 %9 %J JMIR Med Educ %G English %X Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT’s performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5’s Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making. %R 10.2196/52746 %U https://mededu.jmir.org/2024/1/e52746 %U https://doi.org/10.2196/52746 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52346 %T Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models %A Claman,Daniel %A Sezgin,Emre %K artificial intelligence %K large language models %K dental education %K GPT %K ChatGPT %K periodontal health %K AI %K LLM %K LLMs %K chatbot %K natural language %K generative pretrained transformer %K innovation %K technology %K large language model %D 2024 %7 27.9.2024 %9 %J JMIR Med Educ %G English %X Instructional and clinical technologies have been transforming dental education. With the emergence of artificial intelligence (AI), the opportunities of using AI in education has increased. With the recent advancement of generative AI, large language models (LLMs) and foundation models gained attention with their capabilities in natural language understanding and generation as well as combining multiple types of data, such as text, images, and audio. A common example has been ChatGPT, which is based on a powerful LLM—the GPT model. This paper discusses the potential benefits and challenges of incorporating LLMs in dental education, focusing on periodontal charting with a use case to outline capabilities of LLMs. LLMs can provide personalized feedback, generate case scenarios, and create educational content to contribute to the quality of dental education. However, challenges, limitations, and risks exist, including bias and inaccuracy in the content created, privacy and security concerns, and the risk of overreliance. With guidance and oversight, and by effectively and ethically integrating LLMs, dental education can incorporate engaging and personalized learning experiences for students toward readiness for real-life clinical practice. %R 10.2196/52346 %U https://mededu.jmir.org/2024/1/e52346 %U https://doi.org/10.2196/52346 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56859 %T Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study %A Yoon,Soo-Hyuk %A Oh,Seok Kyeong %A Lim,Byung Gun %A Lee,Ho-Jin %+ Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Daehak-ro 101, Jongno-gu, Seoul, 03080, Republic of Korea, 82 220720039, hjpainfree@snu.ac.kr %K AI tools %K problem solving %K anesthesiology %K artificial intelligence %K pain medicine %K ChatGPT %K health care %K medical education %K South Korea %D 2024 %7 16.9.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4’s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. %M 39284182 %R 10.2196/56859 %U https://mededu.jmir.org/2024/1/e56859 %U https://doi.org/10.2196/56859 %U http://www.ncbi.nlm.nih.gov/pubmed/39284182 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60501 %T Prompt Engineering Paradigms for Medical Applications: Scoping Review %A Zaghir,Jamil %A Naguib,Marco %A Bjelogrlic,Mina %A Névéol,Aurélie %A Tannier,Xavier %A Lovis,Christian %+ Department of Radiology and Medical Informatics, University of Geneva, Chemin des Mines, 9, Geneva, 1202, Switzerland, 41 022 379 08 18, Jamil.Zaghir@unige.ch %K prompt engineering %K prompt design %K prompt learning %K prompt tuning %K large language models %K LLMs %K scoping review %K clinical natural language processing %K natural language processing %K NLP %K medical texts %K medical application %K medical applications %K clinical practice %K privacy %K medicine %K computer science %K medical informatics %D 2024 %7 10.9.2024 %9 Review %J J Med Internet Res %G English %X Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering–based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering–specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field. %M 39255030 %R 10.2196/60501 %U https://www.jmir.org/2024/1/e60501 %U https://doi.org/10.2196/60501 %U http://www.ncbi.nlm.nih.gov/pubmed/39255030 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e58478 %T Practical Applications of Large Language Models for Health Care Professionals and Scientists %A Reis,Florian %A Lenz,Christian %A Gossen,Manfred %A Volk,Hans-Dieter %A Drzeniek,Norman Michael %K artificial intelligence %K healthcare %K chatGPT %K large language model %K prompting %K LLM %K applications %K AI %K scientists %K physicians %K health care %D 2024 %7 5.9.2024 %9 %J JMIR Med Inform %G English %X With the popularization of large language models (LLMs), strategies for their effective and safe usage in health care and research have become increasingly pertinent. Despite the growing interest and eagerness among health care professionals and scientists to exploit the potential of LLMs, initial attempts may yield suboptimal results due to a lack of user experience, thus complicating the integration of artificial intelligence (AI) tools into workplace routine. Focusing on scientists and health care professionals with limited LLM experience, this viewpoint article highlights and discusses 6 easy-to-implement use cases of practical relevance. These encompass customizing translations, refining text and extracting information, generating comprehensive overviews and specialized insights, compiling ideas into cohesive narratives, crafting personalized educational materials, and facilitating intellectual sparring. Additionally, we discuss general prompting strategies and precautions for the implementation of AI tools in biomedicine. Despite various hurdles and challenges, the integration of LLMs into daily routines of physicians and researchers promises heightened workplace productivity and efficiency. %R 10.2196/58478 %U https://medinform.jmir.org/2024/1/e58478 %U https://doi.org/10.2196/58478 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57896 %T Current Status of ChatGPT Use in Medical Education: Potentials, Challenges, and Strategies %A Xu,Tianhui %A Weng,Huiting %A Liu,Fang %A Yang,Li %A Luo,Yuanyuan %A Ding,Ziwei %A Wang,Qin %+ Clinical Nursing Teaching and Research Section, The Second Xiangya Hospital of Central South University, 139 Middle Renmin Road, Changsha, 410011, China, 86 18774806226, wangqin3421@csu.edu.cn %K chat generative pretrained transformer %K ChatGPT %K artificial intelligence %K medical education %K natural language processing %K clinical practice %D 2024 %7 28.8.2024 %9 Viewpoint %J J Med Internet Res %G English %X ChatGPT, a generative pretrained transformer, has garnered global attention and sparked discussions since its introduction on November 30, 2022. However, it has generated controversy within the realms of medical education and scientific research. This paper examines the potential applications, limitations, and strategies for using ChatGPT. ChatGPT offers personalized learning support to medical students through its robust natural language generation capabilities, enabling it to furnish answers. Moreover, it has demonstrated significant use in simulating clinical scenarios, facilitating teaching and learning processes, and revitalizing medical education. Nonetheless, numerous challenges accompany these advancements. In the context of education, it is of paramount importance to prevent excessive reliance on ChatGPT and combat academic plagiarism. Likewise, in the field of medicine, it is vital to guarantee the timeliness, accuracy, and reliability of content generated by ChatGPT. Concurrently, ethical challenges and concerns regarding information security arise. In light of these challenges, this paper proposes targeted strategies for addressing them. First, the risk of overreliance on ChatGPT and academic plagiarism must be mitigated through ideological education, fostering comprehensive competencies, and implementing diverse evaluation criteria. The integration of contemporary pedagogical methodologies in conjunction with the use of ChatGPT serves to enhance the overall quality of medical education. To enhance the professionalism and reliability of the generated content, it is recommended to implement measures to optimize ChatGPT’s training data professionally and enhance the transparency of the generation process. This ensures that the generated content is aligned with the most recent standards of medical practice. Moreover, the enhancement of value alignment and the establishment of pertinent legislation or codes of practice address ethical concerns, including those pertaining to algorithmic discrimination, the allocation of medical responsibility, privacy, and security. In conclusion, while ChatGPT presents significant potential in medical education, it also encounters various challenges. Through comprehensive research and the implementation of suitable strategies, it is anticipated that ChatGPT’s positive impact on medical education will be harnessed, laying the groundwork for advancing the discipline and fostering the development of high-caliber medical professionals. %M 39196640 %R 10.2196/57896 %U https://www.jmir.org/2024/1/e57896 %U https://doi.org/10.2196/57896 %U http://www.ncbi.nlm.nih.gov/pubmed/39196640 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50545 %T Integration of ChatGPT Into a Course for Medical Students: Explorative Study on Teaching Scenarios, Students’ Perception, and Applications %A Thomae,Anita V %A Witt,Claudia M %A Barth,Jürgen %K medical education %K ChatGPT %K artificial intelligence %K information for patients %K critical appraisal %K evaluation %K blended learning %K AI %K digital skills %K teaching %D 2024 %7 22.8.2024 %9 %J JMIR Med Educ %G English %X Background: Text-generating artificial intelligence (AI) such as ChatGPT offers many opportunities and challenges in medical education. Acquiring practical skills necessary for using AI in a clinical context is crucial, especially for medical education. Objective: This explorative study aimed to investigate the feasibility of integrating ChatGPT into teaching units and to evaluate the course and the importance of AI-related competencies for medical students. Since a possible application of ChatGPT in the medical field could be the generation of information for patients, we further investigated how such information is perceived by students in terms of persuasiveness and quality. Methods: ChatGPT was integrated into 3 different teaching units of a blended learning course for medical students. Using a mixed methods approach, quantitative and qualitative data were collected. As baseline data, we assessed students’ characteristics, including their openness to digital innovation. The students evaluated the integration of ChatGPT into the course and shared their thoughts regarding the future of text-generating AI in medical education. The course was evaluated based on the Kirkpatrick Model, with satisfaction, learning progress, and applicable knowledge considered as key assessment levels. In ChatGPT-integrating teaching units, students evaluated videos featuring information for patients regarding their persuasiveness on treatment expectations in a self-experience experiment and critically reviewed information for patients written using ChatGPT 3.5 based on different prompts. Results: A total of 52 medical students participated in the study. The comprehensive evaluation of the course revealed elevated levels of satisfaction, learning progress, and applicability specifically in relation to the ChatGPT-integrating teaching units. Furthermore, all evaluation levels demonstrated an association with each other. Higher openness to digital innovation was associated with higher satisfaction and, to a lesser extent, with higher applicability. AI-related competencies in other courses of the medical curriculum were perceived as highly important by medical students. Qualitative analysis highlighted potential use cases of ChatGPT in teaching and learning. In ChatGPT-integrating teaching units, students rated information for patients generated using a basic ChatGPT prompt as “moderate” in terms of comprehensibility, patient safety, and the correct application of communication rules taught during the course. The students’ ratings were considerably improved using an extended prompt. The same text, however, showed the smallest increase in treatment expectations when compared with information provided by humans (patient, clinician, and expert) via videos. Conclusions: This study offers valuable insights into integrating the development of AI competencies into a blended learning course. Integration of ChatGPT enhanced learning experiences for medical students. %R 10.2196/50545 %U https://mededu.jmir.org/2024/1/e50545 %U https://doi.org/10.2196/50545 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e59213 %T A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study %A Holderried,Friederike %A Stegemann-Philipps,Christian %A Herrmann-Werner,Anne %A Festl-Wietek,Teresa %A Holderried,Martin %A Eickhoff,Carsten %A Mahling,Moritz %+ Tübingen Institute for Medical Education (TIME), Medical Faculty, University of Tübingen, Elfriede-Aulhorn-Strasse 10, Tübingen, 72076, Germany, 49 707129 ext 73688, friederike.holderried@med.uni-tuebingen.de %K virtual patients communication %K communication skills %K technology enhanced education %K TEL %K medical education %K ChatGPT %K GPT: LLM %K LLMs %K NLP %K natural language processing %K machine learning %K artificial intelligence %K language model %K language models %K communication %K relationship %K relationships %K chatbot %K chatbots %K conversational agent %K conversational agents %K history %K histories %K simulated %K student %K students %K interaction %K interactions %D 2024 %7 16.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback. Objective: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students’ performance in history taking with a simulated patient. Methods: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients’ responses and provide immediate feedback on the comprehensiveness of the students’ history taking. Students’ interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback. Results: Most of the study’s participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4’s role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed “almost perfect” agreement (Cohen κ=0.832). Less agreement (κ<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model’s assessments were overly specific or diverged from human judgement. Conclusions: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context. %M 39150749 %R 10.2196/59213 %U https://mededu.jmir.org/2024/1/e59213 %U https://doi.org/10.2196/59213 %U http://www.ncbi.nlm.nih.gov/pubmed/39150749 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52784 %T Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study %A Ming,Shuai %A Guo,Qingge %A Cheng,Wenjun %A Lei,Bo %K ChatGPT %K Chinese National Medical Licensing Examination %K large language models %K medical education %K system role %K LLM %K LLMs %K language model %K language models %K artificial intelligence %K chatbot %K chatbots %K conversational agent %K conversational agents %K exam %K exams %K examination %K examinations %K OpenAI %K answer %K answers %K response %K responses %K accuracy %K performance %K China %K Chinese %D 2024 %7 13.8.2024 %9 %J JMIR Med Educ %G English %X Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%‐3.7%) and GPT-3.5 (1.3%‐4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model’s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. %R 10.2196/52784 %U https://mededu.jmir.org/2024/1/e52784 %U https://doi.org/10.2196/52784 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51757 %T Understanding Health Care Students’ Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study %A Cherrez-Ojeda,Ivan %A Gallardo-Bastidas,Juan C %A Robles-Velasco,Karla %A Osorio,María F %A Velez Leon,Eleonor Maria %A Leon Velastegui,Manuel %A Pauletto,Patrícia %A Aguilar-Díaz,F C %A Squassi,Aldo %A González Eras,Susana Patricia %A Cordero Carrasco,Erita %A Chavez Gonzalez,Karol Leonor %A Calderon,Juan C %A Bousquet,Jean %A Bedbrook,Anna %A Faytong-Haro,Marco %+ Universidad Espiritu Santo, Km. 2.5 via Samborondon, Samborondon, 0901952, Ecuador, 593 999981769, ivancherrez@gmail.com %K artificial intelligence %K ChatGPT %K education %K health care %K students %D 2024 %7 13.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. Objective: The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants’ attitudes toward the use of ChatGPT. Methods: A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. Results: Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was “minimal” (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) “somewhat agreed” that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). Conclusions: Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs. %M 39137029 %R 10.2196/51757 %U https://mededu.jmir.org/2024/1/e51757 %U https://doi.org/10.2196/51757 %U http://www.ncbi.nlm.nih.gov/pubmed/39137029 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e59133 %T Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study %A Takahashi,Hiromizu %A Shikino,Kiyoshi %A Kondo,Takeshi %A Komori,Akira %A Yamada,Yuji %A Saita,Mizue %A Naito,Toshio %+ Department of General Medicine, Juntendo University Faculty of Medicine, Bunkyo, 3-1-3 Hongo, Tokyo, 113-0033, Japan, 81 3 3813 3111, hrtakaha@juntendo.ac.jp %K generative AI %K ChatGPT-4 %K medical case generation %K medical education %K clinical vignettes %K AI %K artificial intelligence %K Japanese %K Japan %D 2024 %7 13.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Evaluating the accuracy and educational utility of artificial intelligence–generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. Objective: This study aimed to assess the educational utility of ChatGPT-4–generated clinical vignettes and their applicability in educational settings. Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians’ experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. Conclusions: ChatGPT-4–generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4’s value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application. %M 39137031 %R 10.2196/59133 %U https://mededu.jmir.org/2024/1/e59133 %U https://doi.org/10.2196/59133 %U http://www.ncbi.nlm.nih.gov/pubmed/39137031 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60083 %T Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint %A Zhui,Li %A Fenghe,Li %A Xuehu,Wang %A Qining,Fu %A Wei,Ren %+ Department of Vascular Surgery, The First Affiliated Hospital of Chongqing Medical University, No. 1 of Youyi Road, Yuzhong District, Chongqing, 400016, China, 86 13658339771, renwei_2301@yeah.net %K medical education %K artificial intelligence %K large language models %K medical ethics %K AI %K LLMs %K ethics %K academic integrity %K privacy and data risks %K data security %K data protection %K intellectual property rights %K educational research %D 2024 %7 1.8.2024 %9 Viewpoint %J J Med Internet Res %G English %X This viewpoint article first explores the ethical challenges associated with the future application of large language models (LLMs) in the context of medical education. These challenges include not only ethical concerns related to the development of LLMs, such as artificial intelligence (AI) hallucinations, information bias, privacy and data risks, and deficiencies in terms of transparency and interpretability but also issues concerning the application of LLMs, including deficiencies in emotional intelligence, educational inequities, problems with academic integrity, and questions of responsibility and copyright ownership. This paper then analyzes existing AI-related legal and ethical frameworks and highlights their limitations with regard to the application of LLMs in the context of medical education. To ensure that LLMs are integrated in a responsible and safe manner, the authors recommend the development of a unified ethical framework that is specifically tailored for LLMs in this field. This framework should be based on 8 fundamental principles: quality control and supervision mechanisms; privacy and data protection; transparency and interpretability; fairness and equal treatment; academic integrity and moral norms; accountability and traceability; protection and respect for intellectual property; and the promotion of educational research and innovation. The authors further discuss specific measures that can be taken to implement these principles, thereby laying a solid foundation for the development of a comprehensive and actionable ethical framework. Such a unified ethical framework based on these 8 fundamental principles can provide clear guidance and support for the application of LLMs in the context of medical education. This approach can help establish a balance between technological advancement and ethical safeguards, thereby ensuring that medical education can progress without compromising the principles of fairness, justice, or patient safety and establishing a more equitable, safer, and more efficient environment for medical education. %M 38971715 %R 10.2196/60083 %U https://www.jmir.org/2024/1/e60083 %U https://doi.org/10.2196/60083 %U http://www.ncbi.nlm.nih.gov/pubmed/38971715 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56342 %T Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study %A Burke,Harry B %A Hoang,Albert %A Lopreiato,Joseph O %A King,Heidi %A Hemmer,Paul %A Montgomery,Michael %A Gagarin,Viktoria %K medical education %K generative artificial intelligence %K natural language processing %K ChatGPT %K generative pretrained transformer %K standardized patients %K clinical notes %K free-text notes %K history and physical examination %K large language model %K LLM %K medical student %K medical students %K clinical information %K artificial intelligence %K AI %K patients %K patient %K medicine %D 2024 %7 25.7.2024 %9 %J JMIR Med Educ %G English %X Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students’ free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students’ notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students’ standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. %R 10.2196/56342 %U https://mededu.jmir.org/2024/1/e56342 %U https://doi.org/10.2196/56342 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e58396 %T NVIDIA’s “Chat with RTX” Custom Large Language Model and Personalized AI Chatbot Augments the Value of Electronic Dermatology Reference Material %A Kamel Boulos,Maged N %A Dellavalle,Robert %+ School of Medicine, University of Lisbon, Av Prof Egas Moniz MB, Lisbon, 1649-028, Portugal, 351 920531573, mnkboulos@ieee.org %K AI chatbots %K artificial intelligence %K AI %K generative AI %K large language models %K dermatology %K education %K self-study %K NVIDIA RTX %K retrieval-augmented generation %K RAG %D 2024 %7 24.7.2024 %9 Editorial %J JMIR Dermatol %G English %X This paper demonstrates a new, promising method using generative artificial intelligence (AI) to augment the educational value of electronic textbooks and research papers (locally stored on user’s machine) and maximize their potential for self-study, in a way that goes beyond the standard electronic search and indexing that is already available in all of these textbooks and files. The presented method runs fully locally on the user’s machine, is generally affordable, and does not require high technical expertise to set up and customize with the user’s own content. %M 39047285 %R 10.2196/58396 %U https://derma.jmir.org/2024/1/e58396 %U https://doi.org/10.2196/58396 %U http://www.ncbi.nlm.nih.gov/pubmed/39047285 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52818 %T Appraisal of ChatGPT’s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination %A Cherif,Hela %A Moussa,Chirine %A Missaoui,Abdel Mouhaymen %A Salouage,Issam %A Mokaddem,Salma %A Dhahri,Besma %+ Faculté de Médecine de Tunis, Université de Tunis El Manar, 15, Rue Djebel Lakhdhar – Bab Saadoun, Tunis, 1007, Tunisia, 216 50424534, hela.cherif@fmt.utm.tn %K medical education %K ChatGPT %K GPT %K artificial intelligence %K natural language processing %K NLP %K pulmonary medicine %K pulmonary %K lung %K lungs %K respiratory %K respiration %K pneumology %K comparative analysis %K large language models %K LLMs %K LLM %K language model %K generative AI %K generative artificial intelligence %K generative %K exams %K exam %K examinations %K examination %D 2024 %7 23.7.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT’s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution’s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. %M 39042876 %R 10.2196/52818 %U https://mededu.jmir.org/2024/1/e52818 %U https://doi.org/10.2196/52818 %U http://www.ncbi.nlm.nih.gov/pubmed/39042876 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e51346 %T ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study %A Skryd,Anthony %A Lawrence,Katharine %+ Department of Medicine, NYU Langone Health, 550 1st Avenue, New York City, NY, 10016, United States, 1 646 929 7800, anthony.skryd@nyulangone.org %K ChatGPT %K medical education %K large language models %K LLMs %K clinical decision-making %D 2024 %7 8.5.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Large language models (LLMs) are computational artificial intelligence systems with advanced natural language processing capabilities that have recently been popularized among health care students and educators due to their ability to provide real-time access to a vast amount of medical knowledge. The adoption of LLM technology into medical education and training has varied, and little empirical evidence exists to support its use in clinical teaching environments. Objective: The aim of the study is to identify and qualitatively evaluate potential use cases and limitations of LLM technology for real-time ward-based educational contexts. Methods: A brief, single-site exploratory evaluation of the publicly available ChatGPT-3.5 (OpenAI) was conducted by implementing the tool into the daily attending rounds of a general internal medicine inpatient service at a large urban academic medical center. ChatGPT was integrated into rounds via both structured and organic use, using the web-based “chatbot” style interface to interact with the LLM through conversational free-text and discrete queries. A qualitative approach using phenomenological inquiry was used to identify key insights related to the use of ChatGPT through analysis of ChatGPT conversation logs and associated shorthand notes from the clinical sessions. Results: Identified use cases for ChatGPT integration included addressing medical knowledge gaps through discrete medical knowledge inquiries, building differential diagnoses and engaging dual-process thinking, challenging medical axioms, using cognitive aids to support acute care decision-making, and improving complex care management by facilitating conversations with subspecialties. Potential additional uses included engaging in difficult conversations with patients, exploring ethical challenges and general medical ethics teaching, personal continuing medical education resources, developing ward-based teaching tools, supporting and automating clinical documentation, and supporting productivity and task management. LLM biases, misinformation, ethics, and health equity were identified as areas of concern and potential limitations to clinical and training use. A code of conduct on ethical and appropriate use was also developed to guide team usage on the wards. Conclusions: Overall, ChatGPT offers a novel tool to enhance ward-based learning through rapid information querying, second-order content exploration, and engaged team discussion regarding generated responses. More research is needed to fully understand contexts for educational use, particularly regarding the risks and limitations of the tool in clinical settings and its impacts on trainee development. %M 38717811 %R 10.2196/51346 %U https://formative.jmir.org/2024/1/e51346 %U https://doi.org/10.2196/51346 %U http://www.ncbi.nlm.nih.gov/pubmed/38717811 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e55048 %T Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study %A Rojas,Marcos %A Rojas,Marcelo %A Burgess,Valentina %A Toro-Pérez,Javier %A Salehi,Shima %K artificial intelligence %K AI %K generative artificial intelligence %K medical education %K ChatGPT %K EUNACOM %K medical licensure %K medical license %K medical licensing exam %D 2024 %7 29.4.2024 %9 %J JMIR Med Educ %G English %X Background: The deployment of OpenAI’s ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as “GPT-4 Turbo With Vision”), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile’s medical licensing examinations—a critical step for medical practitioners in Chile—is less explored. This gap highlights the need to evaluate ChatGPT’s adaptability to diverse linguistic and cultural contexts. Objective: This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile. Methods: Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM’s structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. Results: All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT’s performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%). Conclusions: This study reveals ChatGPT’s ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals. %R 10.2196/55048 %U https://mededu.jmir.org/2024/1/e55048 %U https://doi.org/10.2196/55048 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e57054 %T Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study %A Noda,Masao %A Ueno,Takayoshi %A Koshu,Ryota %A Takaso,Yuji %A Shimada,Mari Dias %A Saito,Chizu %A Sugimoto,Hisashi %A Fushiki,Hiroaki %A Ito,Makoto %A Nomura,Akihiro %A Yoshizaki,Tomokazu %+ Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Yakushiji 3311-1, Shimotsuke, 329-0498, Japan, 1 0285442111, doforanabdosuc@gmail.com %K artificial intelligence %K GPT-4v %K large language model %K otolaryngology %K GPT %K ChatGPT %K LLM %K LLMs %K language model %K language models %K head %K respiratory %K ENT: ear %K nose %K throat %K neck %K NLP %K natural language processing %K image %K images %K exam %K exams %K examination %K examinations %K answer %K answers %K answering %K response %K responses %D 2024 %7 28.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence’s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. %M 38546736 %R 10.2196/57054 %U https://mededu.jmir.org/2024/1/e57054 %U https://doi.org/10.2196/57054 %U http://www.ncbi.nlm.nih.gov/pubmed/38546736 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e49964 %T Performance of ChatGPT on the India Undergraduate Community Medicine Examination: Cross-Sectional Study %A Gandhi,Aravind P %A Joesph,Felista Karen %A Rajagopal,Vineeth %A Aparnavi,P %A Katkuri,Sushma %A Dayama,Sonal %A Satapathy,Prakasini %A Khatib,Mahalaqua Nazli %A Gaidhane,Shilpa %A Zahiruddin,Quazi Syed %A Behera,Ashish %+ Department of Community Medicine, All India Institute of Medical Sciences, Room 420 Department of Community Medicine, Plot 2, Sector 20, MIHAN, Nagpur, Maharashtra, 441108, India, 91 9585395395, aravindsocialdoc@gmail.com %K artificial intelligence %K ChatGPT %K community medicine %K India %K large language model %K medical education %K digitalization %D 2024 %7 25.3.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Medical students may increasingly use large language models (LLMs) in their learning. ChatGPT is an LLM at the forefront of this new development in medical education with the capacity to respond to multidisciplinary questions. Objective: The aim of this study was to evaluate the ability of ChatGPT 3.5 to complete the Indian undergraduate medical examination in the subject of community medicine. We further compared ChatGPT scores with the scores obtained by the students. Methods: The study was conducted at a publicly funded medical college in Hyderabad, India. The study was based on the internal assessment examination conducted in January 2023 for students in the Bachelor of Medicine and Bachelor of Surgery Final Year–Part I program; the examination of focus included 40 questions (divided between two papers) from the community medicine subject syllabus. Each paper had three sections with different weightage of marks for each section: section one had two long essay–type questions worth 15 marks each, section two had 8 short essay–type questions worth 5 marks each, and section three had 10 short-answer questions worth 3 marks each. The same questions were administered as prompts to ChatGPT 3.5 and the responses were recorded. Apart from scoring ChatGPT responses, two independent evaluators explored the responses to each question to further analyze their quality with regard to three subdomains: relevancy, coherence, and completeness. Each question was scored in these subdomains on a Likert scale of 1-5. The average of the two evaluators was taken as the subdomain score of the question. The proportion of questions with a score 50% of the maximum score (5) in each subdomain was calculated. Results: ChatGPT 3.5 scored 72.3% on paper 1 and 61% on paper 2. The mean score of the 94 students was 43% on paper 1 and 45% on paper 2. The responses of ChatGPT 3.5 were also rated to be satisfactorily relevant, coherent, and complete for most of the questions (>80%). Conclusions: ChatGPT 3.5 appears to have substantial and sufficient knowledge to understand and answer the Indian medical undergraduate examination in the subject of community medicine. ChatGPT may be introduced to students to enable the self-directed learning of community medicine in pilot mode. However, faculty oversight will be required as ChatGPT is still in the initial stages of development, and thus its potential and reliability of medical content from the Indian context need to be further explored comprehensively. %M 38526538 %R 10.2196/49964 %U https://formative.jmir.org/2024/1/e49964 %U https://doi.org/10.2196/49964 %U http://www.ncbi.nlm.nih.gov/pubmed/38526538 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51151 %T Incorporating ChatGPT in Medical Informatics Education: Mixed Methods Study on Student Perceptions and Experiential Integration Proposals %A Magalhães Araujo,Sabrina %A Cruz-Correia,Ricardo %+ Center for Health Technology and Services Research, Faculty of Medicine, University of Porto, Rua Dr Plácido da Costa, s/n, Porto, 4200-450, Portugal, 351 220 426 91 ext 26911, saraujo@med.up.pt %K education %K medical informatics %K artificial intelligence %K AI %K generative language model %K ChatGPT %D 2024 %7 20.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The integration of artificial intelligence (AI) technologies, such as ChatGPT, in the educational landscape has the potential to enhance the learning experience of medical informatics students and prepare them for using AI in professional settings. The incorporation of AI in classes aims to develop critical thinking by encouraging students to interact with ChatGPT and critically analyze the responses generated by the chatbot. This approach also helps students develop important skills in the field of biomedical and health informatics to enhance their interaction with AI tools. Objective: The aim of the study is to explore the perceptions of students regarding the use of ChatGPT as a learning tool in their educational context and provide professors with examples of prompts for incorporating ChatGPT into their teaching and learning activities, thereby enhancing the educational experience for students in medical informatics courses. Methods: This study used a mixed methods approach to gain insights from students regarding the use of ChatGPT in education. To accomplish this, a structured questionnaire was applied to evaluate students’ familiarity with ChatGPT, gauge their perceptions of its use, and understand their attitudes toward its use in academic and learning tasks. Learning outcomes of 2 courses were analyzed to propose ChatGPT’s incorporation in master’s programs in medicine and medical informatics. Results: The majority of students expressed satisfaction with the use of ChatGPT in education, finding it beneficial for various purposes, including generating academic content, brainstorming ideas, and rewriting text. While some participants raised concerns about potential biases and the need for informed use, the overall perception was positive. Additionally, the study proposed integrating ChatGPT into 2 specific courses in the master’s programs in medicine and medical informatics. The incorporation of ChatGPT was envisioned to enhance student learning experiences and assist in project planning, programming code generation, examination preparation, workflow exploration, and technical interview preparation, thus advancing medical informatics education. In medical teaching, it will be used as an assistant for simplifying the explanation of concepts and solving complex problems, as well as for generating clinical narratives and patient simulators. Conclusions: The study’s valuable insights into medical faculty students’ perspectives and integration proposals for ChatGPT serve as an informative guide for professors aiming to enhance medical informatics education. The research delves into the potential of ChatGPT, emphasizes the necessity of collaboration in academic environments, identifies subject areas with discernible benefits, and underscores its transformative role in fostering innovative and engaging learning experiences. The envisaged proposals hold promise in empowering future health care professionals to work in the rapidly evolving era of digital health care. %M 38506920 %R 10.2196/51151 %U https://mededu.jmir.org/2024/1/e51151 %U https://doi.org/10.2196/51151 %U http://www.ncbi.nlm.nih.gov/pubmed/38506920 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e54393 %T Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study %A Nakao,Takahiro %A Miki,Soichiro %A Nakamura,Yuta %A Kikuchi,Tomohiro %A Nomura,Yukihiro %A Hanaoka,Shouhei %A Yoshikawa,Takeharu %A Abe,Osamu %+ Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan, 81 358008666, tanakao-tky@umin.ac.jp %K AI %K artificial intelligence %K LLM %K large language model %K language model %K language models %K ChatGPT %K GPT-4 %K GPT-4V %K generative pretrained transformer %K image %K images %K imaging %K response %K responses %K exam %K examination %K exams %K examinations %K answer %K answers %K NLP %K natural language processing %K chatbot %K chatbots %K conversational agent %K conversational agents %K medical education %D 2024 %7 12.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V’s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. %M 38470459 %R 10.2196/54393 %U https://mededu.jmir.org/2024/1/e54393 %U https://doi.org/10.2196/54393 %U http://www.ncbi.nlm.nih.gov/pubmed/38470459 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51426 %T Exploring the Feasibility of Using ChatGPT to Create Just-in-Time Adaptive Physical Activity mHealth Intervention Content: Case Study %A Willms,Amanda %A Liu,Sam %+ School of Exercise Science, Physical and Health Education, University of Victoria, PO Box 3010 STN CSC, Victoria, BC, V8W 2Y2, Canada, 1 250 721 8392, awillms@uvic.ca %K ChatGPT %K digital health %K mobile health %K mHealth %K physical activity %K application %K mobile app %K mobile apps %K content creation %K behavior change %K app design %D 2024 %7 29.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Achieving physical activity (PA) guidelines’ recommendation of 150 minutes of moderate-to-vigorous PA per week has been shown to reduce the risk of many chronic conditions. Despite the overwhelming evidence in this field, PA levels remain low globally. By creating engaging mobile health (mHealth) interventions through strategies such as just-in-time adaptive interventions (JITAIs) that are tailored to an individual’s dynamic state, there is potential to increase PA levels. However, generating personalized content can take a long time due to various versions of content required for the personalization algorithms. ChatGPT presents an incredible opportunity to rapidly produce tailored content; however, there is a lack of studies exploring its feasibility. Objective: This study aimed to (1) explore the feasibility of using ChatGPT to create content for a PA JITAI mobile app and (2) describe lessons learned and future recommendations for using ChatGPT in the development of mHealth JITAI content. Methods: During phase 1, we used Pathverse, a no-code app builder, and ChatGPT to develop a JITAI app to help parents support their child’s PA levels. The intervention was developed based on the Multi-Process Action Control (M-PAC) framework, and the necessary behavior change techniques targeting the M-PAC constructs were implemented in the app design to help parents support their child’s PA. The acceptability of using ChatGPT for this purpose was discussed to determine its feasibility. In phase 2, we summarized the lessons we learned during the JITAI content development process using ChatGPT and generated recommendations to inform future similar use cases. Results: In phase 1, by using specific prompts, we efficiently generated content for 13 lessons relating to increasing parental support for their child’s PA following the M-PAC framework. It was determined that using ChatGPT for this case study to develop PA content for a JITAI was acceptable. In phase 2, we summarized our recommendations into the following six steps when using ChatGPT to create content for mHealth behavior interventions: (1) determine target behavior, (2) ground the intervention in behavior change theory, (3) design the intervention structure, (4) input intervention structure and behavior change constructs into ChatGPT, (5) revise the ChatGPT response, and (6) customize the response to be used in the intervention. Conclusions: ChatGPT offers a remarkable opportunity for rapid content creation in the context of an mHealth JITAI. Although our case study demonstrated that ChatGPT was acceptable, it is essential to approach its use, along with other language models, with caution. Before delivering content to population groups, expert review is crucial to ensure accuracy and relevancy. Future research and application of these guidelines are imperative as we deepen our understanding of ChatGPT and its interactions with human input. %M 38421689 %R 10.2196/51426 %U https://mededu.jmir.org/2024/1/e51426 %U https://doi.org/10.2196/51426 %U http://www.ncbi.nlm.nih.gov/pubmed/38421689 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e48989 %T Using ChatGPT-Like Solutions to Bridge the Communication Gap Between Patients With Rheumatoid Arthritis and Health Care Professionals %A Chen,Chih-Wei %A Walter,Paul %A Wei,James Cheng-Chung %+ National Applied Research Laboratories, 3F, No 106, Sector 2, Heping East Road, Taipei, 106214, Taiwan, 886 975303092, chihwei.chen@udm.global %K rheumatoid arthritis %K ChatGPT %K artificial intelligence %K communication gap %K privacy %K data management %D 2024 %7 27.2.2024 %9 Viewpoint %J JMIR Med Educ %G English %X The communication gap between patients and health care professionals has led to increased disputes and resource waste in the medical domain. The development of artificial intelligence and other technologies brings new possibilities to solve this problem. This viewpoint paper proposes a new relationship between patients and health care professionals—“shared decision-making”—allowing both sides to obtain a deeper understanding of the disease and reach a consensus during diagnosis and treatment. Then, this paper discusses the important impact of ChatGPT-like solutions in treating rheumatoid arthritis using methotrexate from clinical and patient perspectives. For clinical professionals, ChatGPT-like solutions could provide support in disease diagnosis, treatment, and clinical trials, but attention should be paid to privacy, confidentiality, and regulatory norms. For patients, ChatGPT-like solutions allow easy access to massive amounts of information; however, the information should be carefully managed to ensure safe and effective care. To ensure the effective application of ChatGPT-like solutions in improving the relationship between patients and health care professionals, it is essential to establish a comprehensive database and provide legal, ethical, and other support. Above all, ChatGPT-like solutions could benefit patients and health care professionals if they ensure evidence-based solutions and data protection and collaborate with regulatory authorities and regulatory evolution. %M 38412022 %R 10.2196/48989 %U https://mededu.jmir.org/2024/1/e48989 %U https://doi.org/10.2196/48989 %U http://www.ncbi.nlm.nih.gov/pubmed/38412022 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51523 %T Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard %A Farhat,Faiza %A Chaudhry,Beenish Moalla %A Nadeem,Mohammad %A Sohail,Shahab Saquib %A Madsen,Dag Øivind %+ School of Business, University of South-Eastern Norway, Bredalsveien 14, Hønefoss, 3511, Norway, 47 31008732, dag.oivind.madsen@usn.no %K accuracy %K AI model %K artificial intelligence %K Bard %K ChatGPT %K educational task %K GPT-4 %K Generative Pre-trained Transformers %K large language models %K medical education, medical exam %K natural language processing %K performance %K premedical exams %K suitability %D 2024 %7 21.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India. Objective: This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. Methods: In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models. Results: It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59. Conclusions: The study’s findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs’ performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments. %M 38381486 %R 10.2196/51523 %U https://mededu.jmir.org/2024/1/e51523 %U https://doi.org/10.2196/51523 %U http://www.ncbi.nlm.nih.gov/pubmed/38381486 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51391 %T Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models %A Abdullahi,Tassallah %A Singh,Ritambhara %A Eickhoff,Carsten %+ School of Medicine, University of Tübingen, Schaffhausenstr, 77, Tübingen, 72072, Germany, 49 7071 29 843, carsten.eickhoff@uni-tuebingen.de %K clinical decision support %K rare diseases %K complex diseases %K prompt engineering %K reliability %K consistency %K natural language processing %K language model %K Bard %K ChatGPT 3.5 %K GPT-4 %K MedAlpaca %K medical education %K complex diagnosis %K artificial intelligence %K AI assistance %K medical training %K prediction model %D 2024 %7 13.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. Objective: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. Methods: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. Results: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. Conclusions: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model’s characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes. %M 38349725 %R 10.2196/51391 %U https://mededu.jmir.org/2024/1/e51391 %U https://doi.org/10.2196/51391 %U http://www.ncbi.nlm.nih.gov/pubmed/38349725 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e48949 %T Cocreating an Automated mHealth Apps Systematic Review Process With Generative AI: Design Science Research Approach %A Giunti,Guido %A Doherty,Colin P %+ Academic Unit of Neurology, School of Medicine, Trinity College Dublin, College Green, Dublin, D02, Ireland, 353 1 896 1000, drguidogiunti@gmail.com %K generative artificial intelligence %K mHealth %K ChatGPT %K evidence-base %K apps %K qualitative study %K design science research %K eHealth %K mobile device %K AI %K language model %K mHealth intervention %K generative AI %K AI tool %K software code %K systematic review %K language model %D 2024 %7 12.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The use of mobile devices for delivering health-related services (mobile health [mHealth]) has rapidly increased, leading to a demand for summarizing the state of the art and practice through systematic reviews. However, the systematic review process is a resource-intensive and time-consuming process. Generative artificial intelligence (AI) has emerged as a potential solution to automate tedious tasks. Objective: This study aimed to explore the feasibility of using generative AI tools to automate time-consuming and resource-intensive tasks in a systematic review process and assess the scope and limitations of using such tools. Methods: We used the design science research methodology. The solution proposed is to use cocreation with a generative AI, such as ChatGPT, to produce software code that automates the process of conducting systematic reviews. Results: A triggering prompt was generated, and assistance from the generative AI was used to guide the steps toward developing, executing, and debugging a Python script. Errors in code were solved through conversational exchange with ChatGPT, and a tentative script was created. The code pulled the mHealth solutions from the Google Play Store and searched their descriptions for keywords that hinted toward evidence base. The results were exported to a CSV file, which was compared to the initial outputs of other similar systematic review processes. Conclusions: This study demonstrates the potential of using generative AI to automate the time-consuming process of conducting systematic reviews of mHealth apps. This approach could be particularly useful for researchers with limited coding skills. However, the study has limitations related to the design science research methodology, subjectivity bias, and the quality of the search results used to train the language model. %M 38345839 %R 10.2196/48949 %U https://mededu.jmir.org/2024/1/e48949 %U https://doi.org/10.2196/48949 %U http://www.ncbi.nlm.nih.gov/pubmed/38345839 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e48514 %T Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study %A Yu,Peng %A Fang,Changchang %A Liu,Xiaolin %A Fu,Wanying %A Ling,Jitao %A Yan,Zhiwei %A Jiang,Yuan %A Cao,Zhengyu %A Wu,Maoxiong %A Chen,Zhiteng %A Zhu,Wengen %A Zhang,Yuling %A Abudukeremu,Ayiguli %A Wang,Yue %A Liu,Xiao %A Wang,Jingfeng %+ Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, 107 Yanjiang West Road, Guangzhou, China, 86 15083827378, liux587@mail.sysu.edu.cn %K ChatGPT %K Chinese Postgraduate Examination for Clinical Medicine %K medical student %K performance %K artificial intelligence %K medical care %K qualitative feedback %K medical education %K clinical decision-making %D 2024 %7 9.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT, an artificial intelligence (AI) based on large-scale language models, has sparked interest in the field of health care. Nonetheless, the capabilities of AI in text comprehension and generation are constrained by the quality and volume of available training data for a specific language, and the performance of AI across different languages requires further investigation. While AI harbors substantial potential in medicine, it is imperative to tackle challenges such as the formulation of clinical care standards; facilitating cultural transitions in medical education and practice; and managing ethical issues including data privacy, consent, and bias. Objective: The study aimed to evaluate ChatGPT’s performance in processing Chinese Postgraduate Examination for Clinical Medicine questions, assess its clinical reasoning ability, investigate potential limitations with the Chinese language, and explore its potential as a valuable tool for medical professionals in the Chinese context. Methods: A data set of Chinese Postgraduate Examination for Clinical Medicine questions was used to assess the effectiveness of ChatGPT’s (version 3.5) medical knowledge in the Chinese language, which has a data set of 165 medical questions that were divided into three categories: (1) common questions (n=90) assessing basic medical knowledge, (2) case analysis questions (n=45) focusing on clinical decision-making through patient case evaluations, and (3) multichoice questions (n=30) requiring the selection of multiple correct answers. First of all, we assessed whether ChatGPT could meet the stringent cutoff score defined by the government agency, which requires a performance within the top 20% of candidates. Additionally, in our evaluation of ChatGPT’s performance on both original and encoded medical questions, 3 primary indicators were used: accuracy, concordance (which validates the answer), and the frequency of insights. Results: Our evaluation revealed that ChatGPT scored 153.5 out of 300 for original questions in Chinese, which signifies the minimum score set to ensure that at least 20% more candidates pass than the enrollment quota. However, ChatGPT had low accuracy in answering open-ended medical questions, with only 31.5% total accuracy. The accuracy for common questions, multichoice questions, and case analysis questions was 42%, 37%, and 17%, respectively. ChatGPT achieved a 90% concordance across all questions. Among correct responses, the concordance was 100%, significantly exceeding that of incorrect responses (n=57, 50%; P<.001). ChatGPT provided innovative insights for 80% (n=132) of all questions, with an average of 2.95 insights per accurate response. Conclusions: Although ChatGPT surpassed the passing threshold for the Chinese Postgraduate Examination for Clinical Medicine, its performance in answering open-ended medical questions was suboptimal. Nonetheless, ChatGPT exhibited high internal concordance and the ability to generate multiple insights in the Chinese language. Future research should investigate the language-based discrepancies in ChatGPT’s performance within the health care context. %M 38335017 %R 10.2196/48514 %U https://mededu.jmir.org/2024/1/e48514 %U https://doi.org/10.2196/48514 %U http://www.ncbi.nlm.nih.gov/pubmed/38335017 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50965 %T Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study %A Meyer,Annika %A Riese,Janik %A Streichert,Thomas %+ Institute for Clinical Chemistry, University Hospital Cologne, Kerpener Str 62, Cologne, 50937, Germany, annika.meyer1@uk-koeln.de %K ChatGPT %K artificial intelligence %K large language model %K medical exams %K medical examinations %K medical education %K LLM %K public trust %K trust %K medical accuracy %K licensing exam %K licensing examination %K improvement %K patient care %K general population %K licensure examination %D 2024 %7 8.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The potential of artificial intelligence (AI)–based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods: To assess GPT-3.5’s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions: The study results highlight ChatGPT’s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4’s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population. %M 38329802 %R 10.2196/50965 %U https://mededu.jmir.org/2024/1/e50965 %U https://doi.org/10.2196/50965 %U http://www.ncbi.nlm.nih.gov/pubmed/38329802 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50705 %T Increasing Realism and Variety of Virtual Patient Dialogues for Prenatal Counseling Education Through a Novel Application of ChatGPT: Exploratory Observational Study %A Gray,Megan %A Baird,Austin %A Sawyer,Taylor %A James,Jasmine %A DeBroux,Thea %A Bartlett,Michelle %A Krick,Jeanne %A Umoren,Rachel %+ Division of Neonatology, University of Washington, M/S FA.2.113, 4800 Sand Point Way, Seattle, WA, 98105, United States, 1 206 919 5476, graym1@uw.edu %K prenatal counseling %K virtual health %K virtual patient %K simulation %K neonatology %K ChatGPT %K AI %K artificial intelligence %D 2024 %7 1.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Using virtual patients, facilitated by natural language processing, provides a valuable educational experience for learners. Generating a large, varied sample of realistic and appropriate responses for virtual patients is challenging. Artificial intelligence (AI) programs can be a viable source for these responses, but their utility for this purpose has not been explored. Objective: In this study, we explored the effectiveness of generative AI (ChatGPT) in developing realistic virtual standardized patient dialogues to teach prenatal counseling skills. Methods: ChatGPT was prompted to generate a list of common areas of concern and questions that families expecting preterm delivery at 24 weeks gestation might ask during prenatal counseling. ChatGPT was then prompted to generate 2 role-plays with dialogues between a parent expecting a potential preterm delivery at 24 weeks and their counseling physician using each of the example questions. The prompt was repeated for 2 unique role-plays: one parent was characterized as anxious and the other as having low trust in the medical system. Role-play scripts were exported verbatim and independently reviewed by 2 neonatologists with experience in prenatal counseling, using a scale of 1-5 on realism, appropriateness, and utility for virtual standardized patient responses. Results: ChatGPT generated 7 areas of concern, with 35 example questions used to generate role-plays. The 35 role-play transcripts generated 176 unique parent responses (median 5, IQR 4-6, per role-play) with 268 unique sentences. Expert review identified 117 (65%) of the 176 responses as indicating an emotion, either directly or indirectly. Approximately half (98/176, 56%) of the responses had 2 or more sentences, and half (88/176, 50%) included at least 1 question. More than half (104/176, 58%) of the responses from role-played parent characters described a feeling, such as being scared, worried, or concerned. The role-plays of parents with low trust in the medical system generated many unique sentences (n=50). Most of the sentences in the responses were found to be reasonably realistic (214/268, 80%), appropriate for variable prenatal counseling conversation paths (233/268, 87%), and usable without more than a minimal modification in a virtual patient program (169/268, 63%). Conclusions: Generative AI programs, such as ChatGPT, may provide a viable source of training materials to expand virtual patient programs, with careful attention to the concerns and questions of patients and families. Given the potential for unrealistic or inappropriate statements and questions, an expert should review AI chat outputs before deploying them in an educational program. %M 38300696 %R 10.2196/50705 %U https://mededu.jmir.org/2024/1/e50705 %U https://doi.org/10.2196/50705 %U http://www.ncbi.nlm.nih.gov/pubmed/38300696 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51344 %T Evaluation of ChatGPT’s Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study %A Kavadella,Argyro %A Dias da Silva,Marco Antonio %A Kaklamanos,Eleftherios G %A Stamatopoulos,Vasileios %A Giannakopoulos,Kostis %+ School of Dentistry, European University Cyprus, 6, Diogenes street, Engomi, Nicosia, 2404, Cyprus, 357 22559620, a.kavadella@euc.ac.cy %K ChatGPT %K large language models %K LLM %K natural language processing %K artificial Intelligence %K dental education %K higher education %K learning assignments %K dental students %K AI pedagogy %K dentistry %K university %D 2024 %7 31.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The recent artificial intelligence tool ChatGPT seems to offer a range of benefits in academic education while also raising concerns. Relevant literature encompasses issues of plagiarism and academic dishonesty, as well as pedagogy and educational affordances; yet, no real-life implementation of ChatGPT in the educational process has been reported to our knowledge so far. Objective: This mixed methods study aimed to evaluate the implementation of ChatGPT in the educational process, both quantitatively and qualitatively. Methods: In March 2023, a total of 77 second-year dental students of the European University Cyprus were divided into 2 groups and asked to compose a learning assignment on “Radiation Biology and Radiation Protection in the Dental Office,” working collaboratively in small subgroups, as part of the educational semester program of the Dentomaxillofacial Radiology module. Careful planning ensured a seamless integration of ChatGPT, addressing potential challenges. One group searched the internet for scientific resources to perform the task and the other group used ChatGPT for this purpose. Both groups developed a PowerPoint (Microsoft Corp) presentation based on their research and presented it in class. The ChatGPT group students additionally registered all interactions with the language model during the prompting process and evaluated the final outcome; they also answered an open-ended evaluation questionnaire, including questions on their learning experience. Finally, all students undertook a knowledge examination on the topic, and the grades between the 2 groups were compared statistically, whereas the free-text comments of the questionnaires were thematically analyzed. Results: Out of the 77 students, 39 were assigned to the ChatGPT group and 38 to the literature research group. Seventy students undertook the multiple choice question knowledge examination, and examination grades ranged from 5 to 10 on the 0-10 grading scale. The Mann-Whitney U test showed that students of the ChatGPT group performed significantly better (P=.045) than students of the literature research group. The evaluation questionnaires revealed the benefits (human-like interface, immediate response, and wide knowledge base), the limitations (need for rephrasing the prompts to get a relevant answer, general content, false citations, and incapability to provide images or videos), and the prospects (in education, clinical practice, continuing education, and research) of ChatGPT. Conclusions: Students using ChatGPT for their learning assignments performed significantly better in the knowledge examination than their fellow students who used the literature research methodology. Students adapted quickly to the technological environment of the language model, recognized its opportunities and limitations, and used it creatively and efficiently. Implications for practice: the study underscores the adaptability of students to technological innovations including ChatGPT and its potential to enhance educational outcomes. Educators should consider integrating ChatGPT into curriculum design; awareness programs are warranted to educate both students and educators about the limitations of ChatGPT, encouraging critical engagement and responsible use. %M 38111256 %R 10.2196/51344 %U https://mededu.jmir.org/2024/1/e51344 %U https://doi.org/10.2196/51344 %U http://www.ncbi.nlm.nih.gov/pubmed/38111256 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50842 %T Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study %A Haddad,Firas %A Saade,Joanna S %+ Department of Ophthalmology, American University of Beirut Medical Center, Bliss Street, Beirut, 1107 2020, Lebanon, 961 1350000 ext 8031, js62@aub.edu.lb %K ChatGPT %K artificial intelligence %K AI %K board examinations %K ophthalmology %K testing %D 2024 %7 18.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT and language learning models have gained attention recently for their ability to answer questions on various examinations across various disciplines. The question of whether ChatGPT could be used to aid in medical education is yet to be answered, particularly in the field of ophthalmology. Objective: The aim of this study is to assess the ability of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4.0 (GPT-4.0) to answer ophthalmology-related questions across different levels of ophthalmology training. Methods: Questions from the United States Medical Licensing Examination (USMLE) steps 1 (n=44), 2 (n=60), and 3 (n=28) were extracted from AMBOSS, and 248 questions (64 easy, 122 medium, and 62 difficult questions) were extracted from the book, Ophthalmology Board Review Q&A, for the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology (OB) Written Qualifying Examination (WQE). Questions were prompted identically and inputted to GPT-3.5 and GPT-4.0. Results: GPT-3.5 achieved a total of 55% (n=210) of correct answers, while GPT-4.0 achieved a total of 70% (n=270) of correct answers. GPT-3.5 answered 75% (n=33) of questions correctly in USMLE step 1, 73.33% (n=44) in USMLE step 2, 60.71% (n=17) in USMLE step 3, and 46.77% (n=116) in the OB-WQE. GPT-4.0 answered 70.45% (n=31) of questions correctly in USMLE step 1, 90.32% (n=56) in USMLE step 2, 96.43% (n=27) in USMLE step 3, and 62.90% (n=156) in the OB-WQE. GPT-3.5 performed poorer as examination levels advanced (P<.001), while GPT-4.0 performed better on USMLE steps 2 and 3 and worse on USMLE step 1 and the OB-WQE (P<.001). The coefficient of correlation (r) between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 as compared to –0.31 (P<.001) for GPT-4.0. GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with an increase in the difficulty level. Both GPT models performed significantly better on certain topics than on others. Conclusions: ChatGPT is far from being considered a part of mainstream medical education. Future models with higher accuracy are needed for the platform to be effective in medical education. %M 38236632 %R 10.2196/50842 %U https://mededu.jmir.org/2024/1/e50842 %U https://doi.org/10.2196/50842 %U http://www.ncbi.nlm.nih.gov/pubmed/38236632 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50174 %T ChatGPT in Medical Education: A Precursor for Automation Bias? %A Nguyen,Tina %+ The University of Texas Medical Branch, 301 University Blvd, Galveston, TX, 77551, United States, 1 4097721118, nguy.t921@gmail.com %K ChatGPT %K artificial intelligence %K AI %K medical students %K residents %K medical school curriculum %K medical education %K automation bias %K large language models %K LLMs %K bias %D 2024 %7 17.1.2024 %9 Editorial %J JMIR Med Educ %G English %X Artificial intelligence (AI) in health care has the promise of providing accurate and efficient results. However, AI can also be a black box, where the logic behind its results is nonrational. There are concerns if these questionable results are used in patient care. As physicians have the duty to provide care based on their clinical judgment in addition to their patients’ values and preferences, it is crucial that physicians validate the results from AI. Yet, there are some physicians who exhibit a phenomenon known as automation bias, where there is an assumption from the user that AI is always right. This is a dangerous mindset, as users exhibiting automation bias will not validate the results, given their trust in AI systems. Several factors impact a user’s susceptibility to automation bias, such as inexperience or being born in the digital age. In this editorial, I argue that these factors and a lack of AI education in the medical school curriculum cause automation bias. I also explore the harms of automation bias and why prospective physicians need to be vigilant when using AI. Furthermore, it is important to consider what attitudes are being taught to students when introducing ChatGPT, which could be some students’ first time using AI, prior to their use of AI in the clinical setting. Therefore, in attempts to avoid the problem of automation bias in the long-term, in addition to incorporating AI education into the curriculum, as is necessary, the use of ChatGPT in medical education should be limited to certain tasks. Otherwise, having no constraints on what ChatGPT should be used for could lead to automation bias. %M 38231545 %R 10.2196/50174 %U https://mededu.jmir.org/2024/1/e50174 %U https://doi.org/10.2196/50174 %U http://www.ncbi.nlm.nih.gov/pubmed/38231545 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e53961 %T A Generative Pretrained Transformer (GPT)–Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study %A Holderried,Friederike %A Stegemann–Philipps,Christian %A Herschbach,Lea %A Moldt,Julia-Astrid %A Nevins,Andrew %A Griewatz,Jan %A Holderried,Martin %A Herrmann-Werner,Anne %A Festl-Wietek,Teresa %A Mahling,Moritz %+ Tübingen Institute for Medical Education, Eberhard Karls University, Elfriede-Aulhorn-Str 10, Tübingen, 72076, Germany, 49 7071 2973715, friederike.holderried@med.uni-tuebingen.de %K simulated patient %K GPT %K generative pretrained transformer %K ChatGPT %K history taking %K medical education %K documentation %K history %K simulated %K simulation %K simulations %K NLP %K natural language processing %K artificial intelligence %K interactive %K chatbot %K chatbots %K conversational agent %K conversational agents %K answer %K answers %K response %K responses %K human computer %K human machine %K usability %K satisfaction %D 2024 %7 16.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions Objective: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication. Methods: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ). Results: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points). Conclusions: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information. %M 38227363 %R 10.2196/53961 %U https://mededu.jmir.org/2024/1/e53961 %U https://doi.org/10.2196/53961 %U http://www.ncbi.nlm.nih.gov/pubmed/38227363 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51388 %T Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project %A Kuo,Nicholas I-Hsien %A Perez-Concha,Oscar %A Hanly,Mark %A Mnatzaganian,Emmanuel %A Hao,Brandon %A Di Sipio,Marcus %A Yu,Guolin %A Vanjara,Jash %A Valerie,Ivy Cerelia %A de Oliveira Costa,Juliana %A Churches,Timothy %A Lujic,Sanja %A Hegarty,Jo %A Jorm,Louisa %A Barbieri,Sebastiano %+ Centre for Big Data Research in Health, The University of New South Wales, Level 2, AGSM Building (G27), Botany St, Kensington NSW, Sydney, 2052, Australia, 61 0293850645, n.kuo@unsw.edu.au %K medical education %K generative model %K generative adversarial networks %K privacy %K antiretroviral therapy (ART) %K human immunodeficiency virus (HIV) %K data science %K educational purposes %K accessibility %K data privacy %K data sets %K sepsis %K hypotension %K HIV %K science education %K health care AI %D 2024 %7 16.1.2024 %9 Viewpoint %J JMIR Med Educ %G English %X Large-scale medical data sets are vital for hands-on education in health data science but are often inaccessible due to privacy concerns. Addressing this gap, we developed the Health Gym project, a free and open-source platform designed to generate synthetic health data sets applicable to various areas of data science education, including machine learning, data visualization, and traditional statistical models. Initially, we generated 3 synthetic data sets for sepsis, acute hypotension, and antiretroviral therapy for HIV infection. This paper discusses the educational applications of Health Gym’s synthetic data sets. We illustrate this through their use in postgraduate health data science courses delivered by the University of New South Wales, Australia, and a Datathon event, involving academics, students, clinicians, and local health district professionals. We also include adaptable worked examples using our synthetic data sets, designed to enrich hands-on tutorial and workshop experiences. Although we highlight the potential of these data sets in advancing data science education and health care artificial intelligence, we also emphasize the need for continued research into the inherent limitations of synthetic data. %M 38227356 %R 10.2196/51388 %U https://mededu.jmir.org/2024/1/e51388 %U https://doi.org/10.2196/51388 %U http://www.ncbi.nlm.nih.gov/pubmed/38227356 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e49970 %T A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study %A Long,Cai %A Lowe,Kayle %A Zhang,Jessica %A Santos,André dos %A Alanazi,Alaa %A O'Brien,Daniel %A Wright,Erin D %A Cote,David %+ Division of Otolaryngology–Head and Neck Surgery, University of Alberta, 8440-112 Street, Edmonton, AB, T6G 2B7, Canada, 1 (780) 407 8822, cai.long.med@gmail.com %K medical licensing %K otolaryngology %K otology %K laryngology %K ear %K nose %K throat %K ENT %K surgery %K surgical %K exam %K exams %K response %K responses %K answer %K answers %K chatbot %K chatbots %K examination %K examinations %K medical education %K otolaryngology/head and neck surgery %K OHNS %K artificial intelligence %K AI %K ChatGPT %K medical examination %K large language models %K language model %K LLM %K LLMs %K wide range information %K patient safety %K clinical implementation %K safety %K machine learning %K NLP %K natural language processing %D 2024 %7 16.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. Objective: We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. Methods: Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. Results: In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. Conclusions: ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation. %M 38227351 %R 10.2196/49970 %U https://mededu.jmir.org/2024/1/e49970 %U https://doi.org/10.2196/49970 %U http://www.ncbi.nlm.nih.gov/pubmed/38227351 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e47339 %T The Use of ChatGPT for Education Modules on Integrated Pharmacotherapy of Infectious Disease: Educators' Perspectives %A Al-Worafi,Yaser Mohammed %A Goh,Khang Wen %A Hermansyah,Andi %A Tan,Ching Siang %A Ming,Long Chiau %+ School of Pharmacy, KPJ Healthcare University, Lot PT 17010 Persiaran Seriemas, Kota Seriemas, Nilai, 71800, Malaysia, 60 67942692, tcsiang@kpju.edu.my %K innovation and technology %K quality education %K sustainable communities %K innovation and infrastructure %K partnerships for the goals %K sustainable education %K social justice %K ChatGPT %K artificial intelligence %K feasibility %D 2024 %7 12.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial Intelligence (AI) plays an important role in many fields, including medical education, practice, and research. Many medical educators started using ChatGPT at the end of 2022 for many purposes. Objective: The aim of this study was to explore the potential uses, benefits, and risks of using ChatGPT in education modules on integrated pharmacotherapy of infectious disease. Methods: A content analysis was conducted to investigate the applications of ChatGPT in education modules on integrated pharmacotherapy of infectious disease. Questions pertaining to curriculum development, syllabus design, lecture note preparation, and examination construction were posed during data collection. Three experienced professors rated the appropriateness and precision of the answers provided by ChatGPT. The consensus rating was considered. The professors also discussed the prospective applications, benefits, and risks of ChatGPT in this educational setting. Results: ChatGPT demonstrated the ability to contribute to various aspects of curriculum design, with ratings ranging from 50% to 92% for appropriateness and accuracy. However, there were limitations and risks associated with its use, including incomplete syllabi, the absence of essential learning objectives, and the inability to design valid questionnaires and qualitative studies. It was suggested that educators use ChatGPT as a resource rather than relying primarily on its output. There are recommendations for effectively incorporating ChatGPT into the curriculum of the education modules on integrated pharmacotherapy of infectious disease. Conclusions: Medical and health sciences educators can use ChatGPT as a guide in many aspects related to the development of the curriculum of the education modules on integrated pharmacotherapy of infectious disease, syllabus design, lecture notes preparation, and examination preparation with caution. %M 38214967 %R 10.2196/47339 %U https://mededu.jmir.org/2024/1/e47339 %U https://doi.org/10.2196/47339 %U http://www.ncbi.nlm.nih.gov/pubmed/38214967 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51308 %T Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study %A Zaleski,Amanda L %A Berkowsky,Rachel %A Craig,Kelly Jean Thomas %A Pescatello,Linda S %+ Clinical Evidence Development, Aetna Medical Affairs, CVS Health Corporation, 151 Farmington Avenue, Hartford, CT, 06156, United States, 1 8605385003, zaleskia@aetna.com %K exercise prescription %K health literacy %K large language model %K patient education %K artificial intelligence %K AI %K chatbot %D 2024 %7 11.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Regular physical activity is critical for health and disease prevention. Yet, health care providers and patients face barriers to implement evidence-based lifestyle recommendations. The potential to augment care with the increased availability of artificial intelligence (AI) technologies is limitless; however, the suitability of AI-generated exercise recommendations has yet to be explored. Objective: The purpose of this study was to assess the comprehensiveness, accuracy, and readability of individualized exercise recommendations generated by a novel AI chatbot. Methods: A coding scheme was developed to score AI-generated exercise recommendations across ten categories informed by gold-standard exercise recommendations, including (1) health condition–specific benefits of exercise, (2) exercise preparticipation health screening, (3) frequency, (4) intensity, (5) time, (6) type, (7) volume, (8) progression, (9) special considerations, and (10) references to the primary literature. The AI chatbot was prompted to provide individualized exercise recommendations for 26 clinical populations using an open-source application programming interface. Two independent reviewers coded AI-generated content for each category and calculated comprehensiveness (%) and factual accuracy (%) on a scale of 0%-100%. Readability was assessed using the Flesch-Kincaid formula. Qualitative analysis identified and categorized themes from AI-generated output. Results: AI-generated exercise recommendations were 41.2% (107/260) comprehensive and 90.7% (146/161) accurate, with the majority (8/15, 53%) of inaccuracy related to the need for exercise preparticipation medical clearance. Average readability level of AI-generated exercise recommendations was at the college level (mean 13.7, SD 1.7), with an average Flesch reading ease score of 31.1 (SD 7.7). Several recurring themes and observations of AI-generated output included concern for liability and safety, preference for aerobic exercise, and potential bias and direct discrimination against certain age-based populations and individuals with disabilities. Conclusions: There were notable gaps in the comprehensiveness, accuracy, and readability of AI-generated exercise recommendations. Exercise and health care professionals should be aware of these limitations when using and endorsing AI-based technologies as a tool to support lifestyle change involving exercise. %M 38206661 %R 10.2196/51308 %U https://mededu.jmir.org/2024/1/e51308 %U https://doi.org/10.2196/51308 %U http://www.ncbi.nlm.nih.gov/pubmed/38206661 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51247 %T Artificial Intelligence in Medicine: Cross-Sectional Study Among Medical Students on Application, Education, and Ethical Aspects %A Weidener,Lukas %A Fischer,Michael %+ Research Unit for Quality and Ethics in Health Care, UMIT TIROL – Private University for Health Sciences and Health Technology, Eduard-Wallnöfer-Zentrum 1, Hall in Tirol, 6060, Austria, 43 17670491594, lukas.weidener@edu.umit-tirol.at %K artificial intelligence %K AI technology %K medicine %K medical education %K medical curriculum %K medical school %K AI ethics %K ethics %D 2024 %7 5.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The use of artificial intelligence (AI) in medicine not only directly impacts the medical profession but is also increasingly associated with various potential ethical aspects. In addition, the expanding use of AI and AI-based applications such as ChatGPT demands a corresponding shift in medical education to adequately prepare future practitioners for the effective use of these tools and address the associated ethical challenges they present. Objective: This study aims to explore how medical students from Germany, Austria, and Switzerland perceive the use of AI in medicine and the teaching of AI and AI ethics in medical education in accordance with their use of AI-based chat applications, such as ChatGPT. Methods: This cross-sectional study, conducted from June 15 to July 15, 2023, surveyed medical students across Germany, Austria, and Switzerland using a web-based survey. This study aimed to assess students’ perceptions of AI in medicine and the integration of AI and AI ethics into medical education. The survey, which included 53 items across 6 sections, was developed and pretested. Data analysis used descriptive statistics (median, mode, IQR, total number, and percentages) and either the chi-square or Mann-Whitney U tests, as appropriate. Results: Surveying 487 medical students across Germany, Austria, and Switzerland revealed limited formal education on AI or AI ethics within medical curricula, although 38.8% (189/487) had prior experience with AI-based chat applications, such as ChatGPT. Despite varied prior exposures, 71.7% (349/487) anticipated a positive impact of AI on medicine. There was widespread consensus (385/487, 74.9%) on the need for AI and AI ethics instruction in medical education, although the current offerings were deemed inadequate. Regarding the AI ethics education content, all proposed topics were rated as highly relevant. Conclusions: This study revealed a pronounced discrepancy between the use of AI-based (chat) applications, such as ChatGPT, among medical students in Germany, Austria, and Switzerland and the teaching of AI in medical education. To adequately prepare future medical professionals, there is an urgent need to integrate the teaching of AI and AI ethics into the medical curricula. %M 38180787 %R 10.2196/51247 %U https://mededu.jmir.org/2024/1/e51247 %U https://doi.org/10.2196/51247 %U http://www.ncbi.nlm.nih.gov/pubmed/38180787 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51148 %T Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis %A Knoedler,Leonard %A Alfertshofer,Michael %A Knoedler,Samuel %A Hoch,Cosima C %A Funk,Paul F %A Cotofana,Sebastian %A Maheta,Bhagvat %A Frank,Konstantin %A Brébant,Vanessa %A Prantl,Lukas %A Lamby,Philipp %+ Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Franz-Josef-Strauß-Allee 11, Regensburg, 93053, Germany, 49 151 44824958, leonardknoedler@t-online.de %K ChatGPT %K United States Medical Licensing Examination %K artificial intelligence %K USMLE %K USMLE Step 1 %K OpenAI %K medical education %K clinical decision-making %D 2024 %7 5.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective: This paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics. %M 38180782 %R 10.2196/51148 %U https://mededu.jmir.org/2024/1/e51148 %U https://doi.org/10.2196/51148 %U http://www.ncbi.nlm.nih.gov/pubmed/38180782 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51183 %T Generative Language Models and Open Notes: Exploring the Promise and Limitations %A Blease,Charlotte %A Torous,John %A McMillan,Brian %A Hägglund,Maria %A Mandl,Kenneth D %+ Department of Women's and Children's Health, Uppsala University, Box 256, Uppsala, 751 05, Sweden, 46 18 471 00 0, charlotteblease@gmail.com %K ChatGPT %K generative language models %K large language models %K medical education %K Open Notes %K online record access %K patient-centered care %K empathy %K language model %K online record access %K documentation %K communication tool %K clinical documentation %D 2024 %7 4.1.2024 %9 Viewpoint %J JMIR Med Educ %G English %X Patients’ online record access (ORA) is growing worldwide. In some countries, including the United States and Sweden, access is advanced with patients obtaining rapid access to their full records on the web including laboratory and test results, lists of prescribed medications, vaccinations, and even the very narrative reports written by clinicians (the latter, commonly referred to as “open notes”). In the United States, patient’s ORA is also available in a downloadable form for use with other apps. While survey studies have shown that some patients report many benefits from ORA, there remain challenges with implementation around writing clinical documentation that patients may now read. With ORA, the functionality of the record is evolving; it is no longer only an aide memoire for doctors but also a communication tool for patients. Studies suggest that clinicians are changing how they write documentation, inviting worries about accuracy and completeness. Other concerns include work burdens; while few objective studies have examined the impact of ORA on workload, some research suggests that clinicians are spending more time writing notes and answering queries related to patients’ records. Aimed at addressing some of these concerns, clinician and patient education strategies have been proposed. In this viewpoint paper, we explore these approaches and suggest another longer-term strategy: the use of generative artificial intelligence (AI) to support clinicians in documenting narrative summaries that patients will find easier to understand. Applied to narrative clinical documentation, we suggest that such approaches may significantly help preserve the accuracy of notes, strengthen writing clarity and signals of empathy and patient-centered care, and serve as a buffer against documentation work burdens. However, we also consider the current risks associated with existing generative AI. We emphasize that for this innovation to play a key role in ORA, the cocreation of clinical notes will be imperative. We also caution that clinicians will need to be supported in how to work alongside generative AI to optimize its considerable potential. %M 38175688 %R 10.2196/51183 %U https://mededu.jmir.org/2024/1/e51183 %U https://doi.org/10.2196/51183 %U http://www.ncbi.nlm.nih.gov/pubmed/38175688 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50869 %T Patients, Doctors, and Chatbots %A Erren,Thomas C %+ Institute and Policlinic for Occupational Medicine, Environmental Medicine and Prevention Research, University Hospital of Cologne, University of Cologne, Berlin-Kölnische Allee 4, Köln (Zollstock), 50937, Germany, 49 022147876780, tim.erren@uni-koeln.de %K chatbot %K ChatGPT %K medical advice %K ethics %K patients %K doctors %D 2024 %7 4.1.2024 %9 Viewpoint %J JMIR Med Educ %G English %X Medical advice is key to the relationship between doctor and patient. The question I will address is “how may chatbots affect the interaction between patients and doctors in regards to medical advice?” I describe what lies ahead when using chatbots and identify questions galore for the daily work of doctors. I conclude with a gloomy outlook, expectations for the urgently needed ethical discourse, and a hope in relation to humans and machines. %M 38175695 %R 10.2196/50869 %U https://mededu.jmir.org/2024/1/e50869 %U https://doi.org/10.2196/50869 %U http://www.ncbi.nlm.nih.gov/pubmed/38175695 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51199 %T Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care %A Koranteng,Erica %A Rao,Arya %A Flores,Efren %A Lev,Michael %A Landman,Adam %A Dreyer,Keith %A Succi,Marc %+ Massachusetts General Hospital, 55 Fruit St, Boston, 02114, United States, 1 617 935 9144, msucci@mgh.harvard.edu %K ChatGPT %K AI %K artificial intelligence %K large language models %K LLMs %K ethics %K empathy %K equity %K bias %K language model %K health care application %K patient care %K care %K development %K framework %K model %K ethical implication %D 2023 %7 28.12.2023 %9 Viewpoint %J JMIR Med Educ %G English %X The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment. %M 38153778 %R 10.2196/51199 %U https://mededu.jmir.org/2023/1/e51199 %U https://doi.org/10.2196/51199 %U http://www.ncbi.nlm.nih.gov/pubmed/38153778 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48904 %T Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study %A Liao,Wenxiong %A Liu,Zhengliang %A Dai,Haixing %A Xu,Shaochen %A Wu,Zihao %A Zhang,Yiyang %A Huang,Xiaoke %A Zhu,Dajiang %A Cai,Hongmin %A Li,Quanzheng %A Liu,Tianming %A Li,Xiang %+ Department of Radiology, Massachusetts General Hospital, 55 Fruit St, Boston, MA, 02114, United States, 1 7062480264, xli60@mgh.harvard.edu %K ChatGPT %K medical ethics %K linguistic analysis %K text classification %K artificial intelligence %K medical texts %K machine learning %D 2023 %7 28.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models, such as ChatGPT, are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the internet. However, medical texts, such as clinical notes and diagnoses, require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to health care and the general public. Objective: This study is among the first on responsible artificial intelligence–generated content in medicine. We focus on analyzing the differences between medical texts written by human experts and those generated by ChatGPT and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. Methods: We first constructed a suite of data sets containing medical texts written by human experts and generated by ChatGPT. We analyzed the linguistic features of these 2 types of content and uncovered differences in vocabulary, parts-of-speech, dependency, sentiment, perplexity, and other aspects. Finally, we designed and implemented machine learning methods to detect medical text generated by ChatGPT. The data and code used in this paper are published on GitHub. Results: Medical texts written by humans were more concrete, more diverse, and typically contained more useful information, while medical texts generated by ChatGPT paid more attention to fluency and logic and usually expressed general terminologies rather than effective information specific to the context of the problem. A bidirectional encoder representations from transformers–based model effectively detected medical texts generated by ChatGPT, and the F1 score exceeded 95%. Conclusions: Although text generated by ChatGPT is grammatically perfect and human-like, the linguistic characteristics of generated medical texts were different from those written by human experts. Medical text generated by ChatGPT could be effectively detected by the proposed machine learning algorithms. This study provides a pathway toward trustworthy and accountable use of large language models in medicine. %M 38153785 %R 10.2196/48904 %U https://mededu.jmir.org/2023/1/e48904 %U https://doi.org/10.2196/48904 %U http://www.ncbi.nlm.nih.gov/pubmed/38153785 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50373 %T AI-Enabled Medical Education: Threads of Change, Promising Futures, and Risky Realities Across Four Potential Future Worlds %A Knopp,Michelle I %A Warm,Eric J %A Weber,Danielle %A Kelleher,Matthew %A Kinnear,Benjamin %A Schumacher,Daniel J %A Santen,Sally A %A Mendonça,Eneida %A Turner,Laurah %+ Department of Medical Education, College of Medicine, University of Cincinnati, Cincinnati, OH, United States, 1 5133303999, turnela@ucmail.uc.edu %K artificial intelligence %K medical education %K scenario planning %K future of healthcare %K ethics and AI %K future %K scenario %K ChatGPT %K generative %K GPT-4 %K ethic %K ethics %K ethical %K strategic planning %K Open-AI %K OpenAI %K privacy %K autonomy %K autonomous %D 2023 %7 25.12.2023 %9 Viewpoint %J JMIR Med Educ %G English %X Background: The rapid trajectory of artificial intelligence (AI) development and advancement is quickly outpacing society's ability to determine its future role. As AI continues to transform various aspects of our lives, one critical question arises for medical education: what will be the nature of education, teaching, and learning in a future world where the acquisition, retention, and application of knowledge in the traditional sense are fundamentally altered by AI? Objective: The purpose of this perspective is to plan for the intersection of health care and medical education in the future. Methods: We used GPT-4 and scenario-based strategic planning techniques to craft 4 hypothetical future worlds influenced by AI's integration into health care and medical education. This method, used by organizations such as Shell and the Accreditation Council for Graduate Medical Education, assesses readiness for alternative futures and effectively manages uncertainty, risk, and opportunity. The detailed scenarios provide insights into potential environments the medical profession may face and lay the foundation for hypothesis generation and idea-building regarding responsible AI implementation. Results: The following 4 worlds were created using OpenAI’s GPT model: AI Harmony, AI conflict, The world of Ecological Balance, and Existential Risk. Risks include disinformation and misinformation, loss of privacy, widening inequity, erosion of human autonomy, and ethical dilemmas. Benefits involve improved efficiency, personalized interventions, enhanced collaboration, early detection, and accelerated research. Conclusions: To ensure responsible AI use, the authors suggest focusing on 3 key areas: developing a robust ethical framework, fostering interdisciplinary collaboration, and investing in education and training. A strong ethical framework emphasizes patient safety, privacy, and autonomy while promoting equity and inclusivity. Interdisciplinary collaboration encourages cooperation among various experts in developing and implementing AI technologies, ensuring that they address the complex needs and challenges in health care and medical education. Investing in education and training prepares professionals and trainees with necessary skills and knowledge to effectively use and critically evaluate AI technologies. The integration of AI in health care and medical education presents a critical juncture between transformative advancements and significant risks. By working together to address both immediate and long-term risks and consequences, we can ensure that AI integration leads to a more equitable, sustainable, and prosperous future for both health care and medical education. As we engage with AI technologies, our collective actions will ultimately determine the state of the future of health care and medical education to harness AI's power while ensuring the safety and well-being of humanity. %M 38145471 %R 10.2196/50373 %U https://mededu.jmir.org/2023/1/e50373 %U https://doi.org/10.2196/50373 %U http://www.ncbi.nlm.nih.gov/pubmed/38145471 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51302 %T Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study %A Alkhaaldi,Saif M I %A Kassab,Carl H %A Dimassi,Zakia %A Oyoun Alsoud,Leen %A Al Fahim,Maha %A Al Hageh,Cynthia %A Ibrahim,Halah %+ Department of Medical Science, Khalifa University College of Medicine and Health Sciences, PO Box 127788, Abu Dhabi, United Arab Emirates, 971 23125423, halah.ibrahim@ku.ac.ae %K medical education %K ChatGPT %K artificial intelligence %K large language models %K LLMs %K AI %K medical student %K medical students %K cross-sectional study %K training %K technology %K medicine %K health care professionals %K risk %K technology %K education %D 2023 %7 22.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) has the potential to revolutionize the way medicine is learned, taught, and practiced, and medical education must prepare learners for these inevitable changes. Academic medicine has, however, been slow to embrace recent AI advances. Since its launch in November 2022, ChatGPT has emerged as a fast and user-friendly large language model that can assist health care professionals, medical educators, students, trainees, and patients. While many studies focus on the technology’s capabilities, potential, and risks, there is a gap in studying the perspective of end users. Objective: The aim of this study was to gauge the experiences and perspectives of graduating medical students on ChatGPT and AI in their training and future careers. Methods: A cross-sectional web-based survey of recently graduated medical students was conducted in an international academic medical center between May 5, 2023, and June 13, 2023. Descriptive statistics were used to tabulate variable frequencies. Results: Of 325 applicants to the residency programs, 265 completed the survey (an 81.5% response rate). The vast majority of respondents denied using ChatGPT in medical school, with 20.4% (n=54) using it to help complete written assessments and only 9.4% using the technology in their clinical work (n=25). More students planned to use it during residency, primarily for exploring new medical topics and research (n=168, 63.4%) and exam preparation (n=151, 57%). Male students were significantly more likely to believe that AI will improve diagnostic accuracy (n=47, 51.7% vs n=69, 39.7%; P=.001), reduce medical error (n=53, 58.2% vs n=71, 40.8%; P=.002), and improve patient care (n=60, 65.9% vs n=95, 54.6%; P=.007). Previous experience with AI was significantly associated with positive AI perception in terms of improving patient care, decreasing medical errors and misdiagnoses, and increasing the accuracy of diagnoses (P=.001, P<.001, P=.008, respectively). Conclusions: The surveyed medical students had minimal formal and informal experience with AI tools and limited perceptions of the potential uses of AI in health care but had overall positive views of ChatGPT and AI and were optimistic about the future of AI in medical education and health care. Structured curricula and formal policies and guidelines are needed to adequately prepare medical learners for the forthcoming integration of AI in medicine. %M 38133911 %R 10.2196/51302 %U https://mededu.jmir.org/2023/1/e51302 %U https://doi.org/10.2196/51302 %U http://www.ncbi.nlm.nih.gov/pubmed/38133911 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50658 %T Using ChatGPT for Clinical Practice and Medical Education: Cross-Sectional Survey of Medical Students’ and Physicians’ Perceptions %A Tangadulrat,Pasin %A Sono,Supinya %A Tangtrakulwanich,Boonsin %+ Department of Orthopedics, Faculty of Medicine, Prince of Songkla University, Floor 9 Rattanacheewarak Building, 15 Kanchanavanich Rd, Hatyai, 90110, Thailand, 66 74451601, boonsin.b@psu.ac.th %K ChatGPT %K AI %K artificial intelligence %K medical education %K medical students %K student %K students %K intern %K interns %K resident %K residents %K knee osteoarthritis %K survey %K surveys %K questionnaire %K questionnaires %K chatbot %K chatbots %K conversational agent %K conversational agents %K attitude %K attitudes %K opinion %K opinions %K perception %K perceptions %K perspective %K perspectives %K acceptance %D 2023 %7 22.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is a well-known large language model–based chatbot. It could be used in the medical field in many aspects. However, some physicians are still unfamiliar with ChatGPT and are concerned about its benefits and risks. Objective: We aim to evaluate the perception of physicians and medical students toward using ChatGPT in the medical field. Methods: A web-based questionnaire was sent to medical students, interns, residents, and attending staff with questions regarding their perception toward using ChatGPT in clinical practice and medical education. Participants were also asked to rate their perception of ChatGPT’s generated response about knee osteoarthritis. Results: Participants included 124 medical students, 46 interns, 37 residents, and 32 attending staff. After reading ChatGPT’s response, 132 of the 239 (55.2%) participants had a positive rating about using ChatGPT for clinical practice. The proportion of positive answers was significantly lower in graduated physicians (48/115, 42%) compared with medical students (84/124, 68%; P<.001). Participants listed a lack of a patient-specific treatment plan, updated evidence, and a language barrier as ChatGPT’s pitfalls. Regarding using ChatGPT for medical education, the proportion of positive responses was also significantly lower in graduate physicians (71/115, 62%) compared to medical students (103/124, 83.1%; P<.001). Participants were concerned that ChatGPT’s response was too superficial, might lack scientific evidence, and might need expert verification. Conclusions: Medical students generally had a positive perception of using ChatGPT for guiding treatment and medical education, whereas graduated doctors were more cautious in this regard. Nonetheless, both medical students and graduated doctors positively perceived using ChatGPT for creating patient educational materials. %M 38133908 %R 10.2196/50658 %U https://mededu.jmir.org/2023/1/e50658 %U https://doi.org/10.2196/50658 %U http://www.ncbi.nlm.nih.gov/pubmed/38133908 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e49183 %T ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case–Based Questions %A Buhr,Christoph Raphael %A Smith,Harry %A Huppertz,Tilman %A Bahr-Hamm,Katharina %A Matthias,Christoph %A Blaikie,Andrew %A Kelsey,Tom %A Kuhn,Sebastian %A Eckrich,Jonas %+ Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, 49 6131 17 7361, buhrchri@uni-mainz.de %K large language models %K LLMs %K LLM %K artificial intelligence %K AI %K ChatGPT %K otorhinolaryngology %K ORL %K digital health %K chatbots %K global health %K low- and middle-income countries %K telemedicine %K telehealth %K language model %K chatbot %D 2023 %7 5.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more “consultations” of LLMs about personal medical symptoms. Objective: This study aims to evaluate ChatGPT’s performance in answering clinical case–based questions in otorhinolaryngology (ORL) in comparison to ORL consultants’ answers. Methods: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs. Results: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT’s scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT’s answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001). Conclusions: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants’ answers. LLMs have potential as augmentative tools for medical care, but their “consultation” for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits. %M 38051578 %R 10.2196/49183 %U https://mededu.jmir.org/2023/1/e49183 %U https://doi.org/10.2196/49183 %U http://www.ncbi.nlm.nih.gov/pubmed/38051578 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51243 %T Can we use ChatGPT for Mental Health and Substance Use Education? Examining Its Quality and Potential Harms %A Spallek,Sophia %A Birrell,Louise %A Kershaw,Stephanie %A Devine,Emma Krogh %A Thornton,Louise %+ The Matilda Centre for Research in Mental Health and Substance Use, The University of Sydney, Level 6, Jane Foss Russell Building (G02), Sydney, 2006, Australia, 61 02 8627 9048, sophia.spallek@sydney.edu.au %K artificial intelligence %K generative artificial intelligence %K large language models %K ChatGPT %K medical education %K health education %K patient education handout %K preventive health services %K educational intervention %K mental health %K substance use %D 2023 %7 30.11.2023 %9 Viewpoint %J JMIR Med Educ %G English %X Background: The use of generative artificial intelligence, more specifically large language models (LLMs), is proliferating, and as such, it is vital to consider both the value and potential harms of its use in medical education. Their efficiency in a variety of writing styles makes LLMs, such as ChatGPT, attractive for tailoring educational materials. However, this technology can feature biases and misinformation, which can be particularly harmful in medical education settings, such as mental health and substance use education. This viewpoint investigates if ChatGPT is sufficient for 2 common health education functions in the field of mental health and substance use: (1) answering users’ direct queries and (2) aiding in the development of quality consumer educational health materials. Objective: This viewpoint includes a case study to provide insight into the accessibility, biases, and quality of ChatGPT’s query responses and educational health materials. We aim to provide guidance for the general public and health educators wishing to utilize LLMs. Methods: We collected real world queries from 2 large-scale mental health and substance use portals and engineered a variety of prompts to use on GPT-4 Pro with the Bing BETA internet browsing plug-in. The outputs were evaluated with tools from the Sydney Health Literacy Lab to determine the accessibility, the adherence to Mindframe communication guidelines to identify biases, and author assessments on quality, including tailoring to audiences, duty of care disclaimers, and evidence-based internet references. Results: GPT-4’s outputs had good face validity, but upon detailed analysis were substandard in comparison to expert-developed materials. Without engineered prompting, the reading level, adherence to communication guidelines, and use of evidence-based websites were poor. Therefore, all outputs still required cautious human editing and oversight. Conclusions: GPT-4 is currently not reliable enough for direct-consumer queries, but educators and researchers can use it for creating educational materials with caution. Materials created with LLMs should disclose the use of generative artificial intelligence and be evaluated on their efficacy with the target audience. %M 38032714 %R 10.2196/51243 %U https://mededu.jmir.org/2023/1/e51243 %U https://doi.org/10.2196/51243 %U http://www.ncbi.nlm.nih.gov/pubmed/38032714 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47274 %T The Intersection of ChatGPT, Clinical Medicine, and Medical Education %A Wong,Rebecca Shin-Yee %A Ming,Long Chiau %A Raja Ali,Raja Affendi %+ School of Medical and Life Sciences, Sunway University, No 5, Jalan Universiti, Bandar Sunway, Selangor, 47500, Malaysia, 60 374918622 ext 7452, longchiauming@gmail.com %K ChatGPT %K clinical research %K large language model %K artificial intelligence %K ethical considerations %K AI %K OpenAI %D 2023 %7 21.11.2023 %9 Viewpoint %J JMIR Med Educ %G English %X As we progress deeper into the digital age, the robust development and application of advanced artificial intelligence (AI) technology, specifically generative language models like ChatGPT (OpenAI), have potential implications in all sectors including medicine. This viewpoint article aims to present the authors’ perspective on the integration of AI models such as ChatGPT in clinical medicine and medical education. The unprecedented capacity of ChatGPT to generate human-like responses, refined through Reinforcement Learning with Human Feedback, could significantly reshape the pedagogical methodologies within medical education. Through a comprehensive review and the authors’ personal experiences, this viewpoint article elucidates the pros, cons, and ethical considerations of using ChatGPT within clinical medicine and notably, its implications for medical education. This exploration is crucial in a transformative era where AI could potentially augment human capability in the process of knowledge creation and dissemination, potentially revolutionizing medical education and clinical practice. The importance of maintaining academic integrity and professional standards is highlighted. The relevance of establishing clear guidelines for the responsible and ethical use of AI technologies in clinical medicine and medical education is also emphasized. %M 37988149 %R 10.2196/47274 %U https://mededu.jmir.org/2023/1/e47274 %U https://doi.org/10.2196/47274 %U http://www.ncbi.nlm.nih.gov/pubmed/37988149 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e49877 %T ChatGPT Interactive Medical Simulations for Early Clinical Education: Case Study %A Scherr,Riley %A Halaseh,Faris F %A Spina,Aidin %A Andalib,Saman %A Rivera,Ronald %+ Irvine School of Medicine, University of California, 1001 Health Sciences Rd, Irvine, CA, 92617, United States, 1 949 824 6119, rscherr@hs.uci.edu %K ChatGPT %K medical school simulations %K preclinical curriculum %K artificial intelligence %K AI %K AI in medical education %K medical education %K simulation %K generative %K curriculum %K clinical education %K simulations %D 2023 %7 10.11.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: The transition to clinical clerkships can be difficult for medical students, as it requires the synthesis and application of preclinical information into diagnostic and therapeutic decisions. ChatGPT—a generative language model with many medical applications due to its creativity, memory, and accuracy—can help students in this transition. Objective: This paper models ChatGPT 3.5’s ability to perform interactive clinical simulations and shows this tool’s benefit to medical education. Methods: Simulation starting prompts were refined using ChatGPT 3.5 in Google Chrome. Starting prompts were selected based on assessment format, stepwise progression of simulation events and questions, free-response question type, responsiveness to user inputs, postscenario feedback, and medical accuracy of the feedback. The chosen scenarios were advanced cardiac life support and medical intensive care (for sepsis and pneumonia). Results: Two starting prompts were chosen. Prompt 1 was developed through 3 test simulations and used successfully in 2 simulations. Prompt 2 was developed through 10 additional test simulations and used successfully in 1 simulation. Conclusions: ChatGPT is capable of creating simulations for early clinical education. These simulations let students practice novel parts of the clinical curriculum, such as forming independent diagnostic and therapeutic impressions over an entire patient encounter. Furthermore, the simulations can adapt to user inputs in a way that replicates real life more accurately than premade question bank clinical vignettes. Finally, ChatGPT can create potentially unlimited free simulations with specific feedback, which increases access for medical students with lower socioeconomic status and underresourced medical schools. However, no tool is perfect, and ChatGPT is no exception; there are concerns about simulation accuracy and replicability that need to be addressed to further optimize ChatGPT’s performance as an educational resource. %M 37948112 %R 10.2196/49877 %U https://mededu.jmir.org/2023/1/e49877 %U https://doi.org/10.2196/49877 %U http://www.ncbi.nlm.nih.gov/pubmed/37948112 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e49459 %T Strengths and Weaknesses of ChatGPT Models for Scientific Writing About Medical Vitamin B12: Mixed Methods Study %A Abuyaman,Omar %+ Department of Medical Laboratory Sciences, Faculty of Applied Medical Sciences, The Hashemite University, FAMS Bldg, 2nd fl, Zarqa, 13133, Jordan, 962 781074280, o.abuyaman@gmail.com %K AI %K ChatGPT %K GPT-4 %K GPT-3.5 %K vitamin B12 %K artificial intelligence %K language editing %K wide range information %K AI solutions %K scientific content %D 2023 %7 10.11.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: ChatGPT is a large language model developed by OpenAI designed to generate human-like responses to prompts. Objective: This study aims to evaluate the ability of GPT-4 to generate scientific content and assist in scientific writing using medical vitamin B12 as the topic. Furthermore, the study will compare the performance of GPT-4 to its predecessor, GPT-3.5. Methods: The study examined responses from GPT-4 and GPT-3.5 to vitamin B12–related prompts, focusing on their quality and characteristics and comparing them to established scientific literature. Results: The results indicated that GPT-4 can potentially streamline scientific writing through its ability to edit language and write abstracts, keywords, and abbreviation lists. However, significant limitations of ChatGPT were revealed, including its inability to identify and address bias, inability to include recent information, lack of transparency, and inclusion of inaccurate information. Additionally, it cannot check for plagiarism or provide proper references. The accuracy of GPT-4’s answers was found to be superior to GPT-3.5. Conclusions: ChatGPT can be considered a helpful assistant in the writing process but not a replacement for a scientist’s expertise. Researchers must remain aware of its limitations and use it appropriately. The improvements in consecutive ChatGPT versions suggest the possibility of overcoming some present limitations in the near future. %M 37948100 %R 10.2196/49459 %U https://formative.jmir.org/2023/1/e49459 %U https://doi.org/10.2196/49459 %U http://www.ncbi.nlm.nih.gov/pubmed/37948100 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47191 %T Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study %A Surapaneni,Krishna Mohan %+ Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600123, India, 91 9789099989, krishnamohan.surapaneni@gmail.com %K ChatGPT %K artificial intelligence %K medical education %K medical Biochemistry %K biochemistry %K chatbot %K case study %K case scenario %K medical exam %K medical examination %K computer generated %D 2023 %7 7.11.2023 %9 Short Paper %J JMIR Med Educ %G English %X Background: ChatGPT has gained global attention recently owing to its high performance in generating a wide range of information and retrieving any kind of data instantaneously. ChatGPT has also been tested for the United States Medical Licensing Examination (USMLE) and has successfully cleared it. Thus, its usability in medical education is now one of the key discussions worldwide. Objective: The objective of this study is to evaluate the performance of ChatGPT in medical biochemistry using clinical case vignettes. Methods: The performance of ChatGPT was evaluated in medical biochemistry using 10 clinical case vignettes. Clinical case vignettes were randomly selected and inputted in ChatGPT along with the response options. We tested the responses for each clinical case twice. The answers generated by ChatGPT were saved and checked using our reference material. Results: ChatGPT generated correct answers for 4 questions on the first attempt. For the other cases, there were differences in responses generated by ChatGPT in the first and second attempts. In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases that were used. But, to our surprise, for case 3, different answers were obtained with multiple attempts. We believe this to have happened owing to the complexity of the case, which involved addressing various critical medical aspects related to amino acid metabolism in a balanced approach. Conclusions: According to the findings of our study, ChatGPT may not be considered an accurate information provider for application in medical education to improve learning and assessment. However, our study was limited by a small sample size (10 clinical case vignettes) and the use of the publicly available version of ChatGPT (version 3.5). Although artificial intelligence (AI) has the capability to transform medical education, we emphasize the validation of such data produced by such AI systems for correctness and dependability before it could be implemented in practice. %M 37934568 %R 10.2196/47191 %U https://mededu.jmir.org/2023/1/e47191 %U https://doi.org/10.2196/47191 %U http://www.ncbi.nlm.nih.gov/pubmed/37934568 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47532 %T The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study %A Ito,Naoki %A Kadomatsu,Sakina %A Fujisawa,Mineto %A Fukaguchi,Kiyomitsu %A Ishizawa,Ryo %A Kanda,Naoki %A Kasugai,Daisuke %A Nakajima,Mikio %A Goto,Tadahiro %A Tsugawa,Yusuke %+ TXP Medical Co Ltd, 41-1 H¹O Kanda 706, Tokyo, 101-0042, Japan, 81 03 5615 8433, tag695@mail.harvard.edu %K GPT-4 %K racial and ethnic bias %K typical clinical vignettes %K diagnosis %K triage %K artificial intelligence %K AI %K race %K clinical vignettes %K physician %K efficiency %K decision-making %K bias %K GPT %D 2023 %7 2.11.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients’ race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. %M 37917120 %R 10.2196/47532 %U https://mededu.jmir.org/2023/1/e47532 %U https://doi.org/10.2196/47532 %U http://www.ncbi.nlm.nih.gov/pubmed/37917120 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51421 %T Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study %A Baglivo,Francesco %A De Angelis,Luigi %A Casigliani,Virginia %A Arzilli,Guglielmo %A Privitera,Gaetano Pierpaolo %A Rizzo,Caterina %+ Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, Via San Zeno 35, Pisa (PI), 56123, Italy, 39 3288348649, f.baglivo@studenti.unipi.it %K artificial intelligence %K chatbots %K medical education %K vaccination %K public health %K medical students %K large language model %K generative AI %K ChatGPT %K Google Bard %K AI chatbot %K health education %K public health %K health care %K medical training %K educational support tool %K chatbot model %D 2023 %7 1.11.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) is a rapidly developing field with the potential to transform various aspects of health care and public health, including medical training. During the “Hygiene and Public Health” course for fifth-year medical students, a practical training session was conducted on vaccination using AI chatbots as an educational supportive tool. Before receiving specific training on vaccination, the students were given a web-based test extracted from the Italian National Medical Residency Test. After completing the test, a critical correction of each question was performed assisted by AI chatbots. Objective: The main aim of this study was to identify whether AI chatbots can be considered educational support tools for training in public health. The secondary objective was to assess the performance of different AI chatbots on complex multiple-choice medical questions in the Italian language. Methods: A test composed of 15 multiple-choice questions on vaccination was extracted from the Italian National Medical Residency Test using targeted keywords and administered to medical students via Google Forms and to different AI chatbot models (Bing Chat, ChatGPT, Chatsonic, Google Bard, and YouChat). The correction of the test was conducted in the classroom, focusing on the critical evaluation of the explanations provided by the chatbot. A Mann-Whitney U test was conducted to compare the performances of medical students and AI chatbots. Student feedback was collected anonymously at the end of the training experience. Results: In total, 36 medical students and 5 AI chatbot models completed the test. The students achieved an average score of 8.22 (SD 2.65) out of 15, while the AI chatbots scored an average of 12.22 (SD 2.77). The results indicated a statistically significant difference in performance between the 2 groups (U=49.5, P<.001), with a large effect size (r=0.69). When divided by question type (direct, scenario-based, and negative), significant differences were observed in direct (P<.001) and scenario-based (P<.001) questions, but not in negative questions (P=.48). The students reported a high level of satisfaction (7.9/10) with the educational experience, expressing a strong desire to repeat the experience (7.6/10). Conclusions: This study demonstrated the efficacy of AI chatbots in answering complex medical questions related to vaccination and providing valuable educational support. Their performance significantly surpassed that of medical students in direct and scenario-based questions. The responsible and critical use of AI chatbots can enhance medical education, making it an essential aspect to integrate into the educational system. %M 37910155 %R 10.2196/51421 %U https://mededu.jmir.org/2023/1/e51421 %U https://doi.org/10.2196/51421 %U http://www.ncbi.nlm.nih.gov/pubmed/37910155 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48452 %T The Potential of GPT-4 as a Support Tool for Pharmacists: Analytical Study Using the Japanese National Examination for Pharmacists %A Kunitsu,Yuki %+ Department of Pharmacy, Shiga University of Medical Science Hospital, Seta Tukinowacho, Otsu, Shiga, 520-2121, Japan, 81 75 548 2111, ykunitsu@belle.shiga-med.ac.jp %K natural language processing %K generative pretrained transformer %K GPT-4 %K ChatGPT %K artificial intelligence %K AI %K chatbot %K pharmacy %K pharmacist %D 2023 %7 30.10.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: The advancement of artificial intelligence (AI), as well as machine learning, has led to its application in various industries, including health care. AI chatbots, such as GPT-4, developed by OpenAI, have demonstrated potential in supporting health care professionals by providing medical information, answering examination questions, and assisting in medical education. However, the applicability of GPT-4 in the field of pharmacy remains unexplored. Objective: This study aimed to evaluate GPT-4’s ability to answer questions from the Japanese National Examination for Pharmacists (JNEP) and assess its potential as a support tool for pharmacists in their daily practice. Methods: The question texts and answer choices from the 107th and 108th JNEP, held in February 2022 and February 2023, were input into GPT-4. As GPT-4 cannot process diagrams, questions that included diagram interpretation were not analyzed and were initially given a score of 0. The correct answer rates were calculated and compared with the passing criteria of each examination to evaluate GPT-4’s performance. Results: For the 107th and 108th JNEP, GPT-4 achieved an accuracy rate of 64.5% (222/344) and 62.9% (217/345), respectively, for all questions. When considering only the questions that GPT-4 could answer, the accuracy rates increased to 78.2% (222/284) and 75.3% (217/287), respectively. The accuracy rates tended to be lower for physics, chemistry, and calculation questions. Conclusions: Although GPT-4 demonstrated the potential to answer questions from the JNEP and support pharmacists’ capabilities, it also showed limitations in handling highly specialized questions, calculation questions, and questions requiring diagram recognition. Further evaluation is necessary to explore its applicability in real-world clinical settings, considering the complexities of patient scenarios and collaboration with health care professionals. By addressing these limitations, GPT-4 could become a more reliable tool for pharmacists in their daily practice. %M 37837968 %R 10.2196/48452 %U https://mededu.jmir.org/2023/1/e48452 %U https://doi.org/10.2196/48452 %U http://www.ncbi.nlm.nih.gov/pubmed/37837968 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48785 %T Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review %A Preiksaitis,Carl %A Rose,Christian %+ Department of Emergency Medicine, Stanford University School of Medicine, 900 Welch Road, Suite 350, Palo Alto, CA, 94304, United States, 1 650 723 6576, cpreiksaitis@stanford.edu %K medical education %K artificial intelligence %K ChatGPT %K Bard %K AI %K educator %K scoping %K review %K learner %K generative %D 2023 %7 20.10.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Generative artificial intelligence (AI) technologies are increasingly being utilized across various fields, with considerable interest and concern regarding their potential application in medical education. These technologies, such as Chat GPT and Bard, can generate new content and have a wide range of possible applications. Objective: This study aimed to synthesize the potential opportunities and limitations of generative AI in medical education. It sought to identify prevalent themes within recent literature regarding potential applications and challenges of generative AI in medical education and use these to guide future areas for exploration. Methods: We conducted a scoping review, following the framework by Arksey and O'Malley, of English language articles published from 2022 onward that discussed generative AI in the context of medical education. A literature search was performed using PubMed, Web of Science, and Google Scholar databases. We screened articles for inclusion, extracted data from relevant studies, and completed a quantitative and qualitative synthesis of the data. Results: Thematic analysis revealed diverse potential applications for generative AI in medical education, including self-directed learning, simulation scenarios, and writing assistance. However, the literature also highlighted significant challenges, such as issues with academic integrity, data accuracy, and potential detriments to learning. Based on these themes and the current state of the literature, we propose the following 3 key areas for investigation: developing learners’ skills to evaluate AI critically, rethinking assessment methodology, and studying human-AI interactions. Conclusions: The integration of generative AI in medical education presents exciting opportunities, alongside considerable challenges. There is a need to develop new skills and competencies related to AI as well as thoughtful, nuanced approaches to examine the growing use of generative AI in medical education. %R 10.2196/48785 %U https://mededu.jmir.org/2023/1/e48785/ %U https://doi.org/10.2196/48785 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e48023 %T Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study %A Yanagita,Yasutaka %A Yokokawa,Daiki %A Uchida,Shun %A Tawara,Junsuke %A Ikusaka,Masatomi %+ Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-ku, Chiba, 260-8677, Japan, 81 43 222 7171 ext 6438, y.yanagita@gmail.com %K artificial intelligence %K ChatGPT %K GPT-4 %K AI %K National Medical Licensing Examination %K Japanese %K NMLE %D 2023 %7 13.10.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT’s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. Objective: The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. Methods: Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. Results: Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. Conclusions: GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information. %M 37831496 %R 10.2196/48023 %U https://formative.jmir.org/2023/1/e48023 %U https://doi.org/10.2196/48023 %U http://www.ncbi.nlm.nih.gov/pubmed/37831496 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48039 %T Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study %A Flores-Cohaila,Javier A %A García-Vicente,Abigaíl %A Vizcarra-Jiménez,Sonia F %A De la Cruz-Galán,Janith P %A Gutiérrez-Arratia,Jesús D %A Quiroga Torres,Blanca Geraldine %A Taype-Rondan,Alvaro %+ Academic Department, USAMEDIC, Jiron Leon Velarde 171. Lince, Lima, 15073, Peru, 51 924 341 073, javierfloresmed@gmail.com %K medical education %K generative pre-trained transformer %K ChatGPT %K licensing examination %K assessment %K Peru %K Examen Nacional de Medicina %K ENAM %K learning model %K artificial intelligence %K AI %K medical examination %D 2023 %7 28.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries’ national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. Objective: We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT. Methods: We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT’s accuracy. Results: GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%). Conclusions: Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy. %M 37768724 %R 10.2196/48039 %U https://mededu.jmir.org/2023/1/e48039 %U https://doi.org/10.2196/48039 %U http://www.ncbi.nlm.nih.gov/pubmed/37768724 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50514 %T Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study %A Huang,Ryan ST %A Lu,Kevin Jia Qi %A Meaney,Christopher %A Kemppainen,Joel %A Punnett,Angela %A Leung,Fok-Han %+ Temerty Faculty of Medicine, University of Toronto, 1 King’s College Cir, Toronto, ON, M5S 1A8, Canada, 1 416 978 6585, ry.huang@mail.utoronto.ca %K medical education %K medical knowledge exam %K artificial intelligence %K AI %K natural language processing %K NLP %K large language model %K LLM %K machine learning, ChatGPT %K GPT-3.5 %K GPT-4 %K education %K language model %K education examination %K testing %K utility %K family medicine %K medical residents %K test %K community %D 2023 %7 19.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language model (LLM)–based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. Objective: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. Methods: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. Results: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). Conclusions: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services. %M 37725411 %R 10.2196/50514 %U https://mededu.jmir.org/2023/1/e50514 %U https://doi.org/10.2196/50514 %U http://www.ncbi.nlm.nih.gov/pubmed/37725411 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47049 %T The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation %A Khlaif,Zuheir N %A Mousa,Allam %A Hattab,Muayad Kamal %A Itmazi,Jamil %A Hassan,Amjad A %A Sanmugam,Mageswaran %A Ayyoub,Abedalkarim %+ Faculty of Humanities and Educational Sciences, An-Najah National University, PO Box 7, Nablus, Occupied Palestinian Territory, 970 592754908, zkhlaif@najah.edu %K artificial intelligence %K AI %K ChatGPT %K scientific research %K research ethics %D 2023 %7 14.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal, education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing (NLP), which refers to the ability of computers to understand and generate human language. Objective: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose, high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing the application’s impact on the research framework, data analysis, and the literature review. The study also explored concerns around ownership and the integrity of research when using AI-generated text. Methods: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchers developed an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated using ChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitative data provided by the reviewers. Results: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality research that could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research framework and data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing. Conclusions: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used in different fields such as medical education to deliver materials to develop the basic competencies for both medicine students and faculty members. %M 37707884 %R 10.2196/47049 %U https://mededu.jmir.org/2023/1/e47049 %U https://doi.org/10.2196/47049 %U http://www.ncbi.nlm.nih.gov/pubmed/37707884 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48254 %T Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study %A Sallam,Malik %A Salim,Nesreen A %A Barakat,Muna %A Al-Mahzoum,Kholoud %A Al-Tammemi,Ala'a B %A Malaeb,Diana %A Hallit,Rabih %A Hallit,Souheil %+ Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Queen Rania Al-Abdullah Street-Aljubeiha, Amman, 11942, Jordan, 962 0791845186, malik.sallam@ju.edu.jo %K artificial intelligence %K machine learning %K education %K technology %K healthcare %K survey %K opinion %K knowledge %K practices %K KAP %D 2023 %7 5.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is a conversational large language model that has the potential to revolutionize knowledge acquisition. However, the impact of this technology on the quality of education is still unknown considering the risks and concerns surrounding ChatGPT use. Therefore, it is necessary to assess the usability and acceptability of this promising tool. As an innovative technology, the intention to use ChatGPT can be studied in the context of the technology acceptance model (TAM). Objective: This study aimed to develop and validate a TAM-based survey instrument called TAME-ChatGPT (Technology Acceptance Model Edited to Assess ChatGPT Adoption) that could be employed to examine the successful integration and use of ChatGPT in health care education. Methods: The survey tool was created based on the TAM framework. It comprised 13 items for participants who heard of ChatGPT but did not use it and 23 items for participants who used ChatGPT. Using a convenient sampling approach, the survey link was circulated electronically among university students between February and March 2023. Exploratory factor analysis (EFA) was used to assess the construct validity of the survey instrument. Results: The final sample comprised 458 respondents, the majority among them undergraduate students (n=442, 96.5%). Only 109 (23.8%) respondents had heard of ChatGPT prior to participation and only 55 (11.3%) self-reported ChatGPT use before the study. EFA analysis on the attitude and usage scales showed significant Bartlett tests of sphericity scores (P<.001) and adequate Kaiser-Meyer-Olkin measures (0.823 for the attitude scale and 0.702 for the usage scale), confirming the factorability of the correlation matrices. The EFA showed that 3 constructs explained a cumulative total of 69.3% variance in the attitude scale, and these subscales represented perceived risks, attitude to technology/social influence, and anxiety. For the ChatGPT usage scale, EFA showed that 4 constructs explained a cumulative total of 72% variance in the data and comprised the perceived usefulness, perceived risks, perceived ease of use, and behavior/cognitive factors. All the ChatGPT attitude and usage subscales showed good reliability with Cronbach α values >.78 for all the deduced subscales. Conclusions: The TAME-ChatGPT demonstrated good reliability, validity, and usefulness in assessing health care students’ attitudes toward ChatGPT. The findings highlighted the importance of considering risk perceptions, usefulness, ease of use, attitudes toward technology, and behavioral factors when adopting ChatGPT as a tool in health care education. This information can aid the stakeholders in creating strategies to support the optimal and ethical use of ChatGPT and to identify the potential challenges hindering its successful implementation. Future research is recommended to guide the effective adoption of ChatGPT in health care education. %M 37578934 %R 10.2196/48254 %U https://mededu.jmir.org/2023/1/e48254 %U https://doi.org/10.2196/48254 %U http://www.ncbi.nlm.nih.gov/pubmed/37578934 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46482 %T Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany %A Roos,Jonas %A Kasapovic,Adnan %A Jansen,Tom %A Kaczmarczyk,Robert %+ Department of Dermatology and Allergy, Technical University of Munich, Biedersteiner Str. 29, Munich, 80802, Germany, 49 08941403033, robert.kaczmarczyk@tum.de %K medical education %K state examinations %K exams %K large language models %K artificial intelligence %K ChatGPT %D 2023 %7 4.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  Objective: This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  Methods: The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  Results: GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  Conclusions: LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.  %M 37665620 %R 10.2196/46482 %U https://mededu.jmir.org/2023/1/e46482 %U https://doi.org/10.2196/46482 %U http://www.ncbi.nlm.nih.gov/pubmed/37665620 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51494 %T Can AI Mitigate Bias in Writing Letters of Recommendation? %A Leung,Tiffany I %A Sagar,Ankita %A Shroff,Swati %A Henry,Tracey L %+ JMIR Publications, 130 Queens Quay East, Unit 1100, Toronto, ON, M5A 0P6, Canada, 1 416 583 2040, tiffany.leung@jmir.org %K sponsorship %K implicit bias %K gender bias %K bias %K letters of recommendation %K artificial intelligence %K large language models %K medical education %K career advancement %K tenure and promotion %K promotion %K leadership %D 2023 %7 23.8.2023 %9 Editorial %J JMIR Med Educ %G English %X Letters of recommendation play a significant role in higher education and career progression, particularly for women and underrepresented groups in medicine and science. Already, there is evidence to suggest that written letters of recommendation contain language that expresses implicit biases, or unconscious biases, and that these biases occur for all recommenders regardless of the recommender’s sex. Given that all individuals have implicit biases that may influence language use, there may be opportunities to apply contemporary technologies, such as large language models or other forms of generative artificial intelligence (AI), to augment and potentially reduce implicit biases in the written language of letters of recommendation. In this editorial, we provide a brief overview of existing literature on the manifestations of implicit bias in letters of recommendation, with a focus on academia and medical education. We then highlight potential opportunities and drawbacks of applying this emerging technology in augmenting the focused, professional task of writing letters of recommendation. We also offer best practices for integrating their use into the routine writing of letters of recommendation and conclude with our outlook for the future of generative AI applications in supporting this task. %M 37610808 %R 10.2196/51494 %U https://mededu.jmir.org/2023/1/e51494 %U https://doi.org/10.2196/51494 %U http://www.ncbi.nlm.nih.gov/pubmed/37610808 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48433 %T Examining Real-World Medication Consultations and Drug-Herb Interactions: ChatGPT Performance Evaluation %A Hsu,Hsing-Yu %A Hsu,Kai-Cheng %A Hou,Shih-Yen %A Wu,Ching-Lung %A Hsieh,Yow-Wen %A Cheng,Yih-Dih %+ Department of Pharmacy, China Medical University Hospital, 2 Yuh-Der Road, Taichung, 404327, Taiwan, 886 4 22052121 ext 12261, yowenhsieh@gmail.com %K ChatGPT %K large language model %K natural language processing %K real-world medication consultation questions %K NLP %K drug-herb interactions %K pharmacist %K LLM %K language models %K chat generative pre-trained transformer %D 2023 %7 21.8.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Since OpenAI released ChatGPT, with its strong capability in handling natural tasks and its user-friendly interface, it has garnered significant attention. Objective: A prospective analysis is required to evaluate the accuracy and appropriateness of medication consultation responses generated by ChatGPT. Methods: A prospective cross-sectional study was conducted by the pharmacy department of a medical center in Taiwan. The test data set comprised retrospective medication consultation questions collected from February 1, 2023, to February 28, 2023, along with common questions about drug-herb interactions. Two distinct sets of questions were tested: real-world medication consultation questions and common questions about interactions between traditional Chinese and Western medicines. We used the conventional double-review mechanism. The appropriateness of each response from ChatGPT was assessed by 2 experienced pharmacists. In the event of a discrepancy between the assessments, a third pharmacist stepped in to make the final decision. Results: Of 293 real-world medication consultation questions, a random selection of 80 was used to evaluate ChatGPT’s performance. ChatGPT exhibited a higher appropriateness rate in responding to public medication consultation questions compared to those asked by health care providers in a hospital setting (31/51, 61% vs 20/51, 39%; P=.01). Conclusions: The findings from this study suggest that ChatGPT could potentially be used for answering basic medication consultation questions. Our analysis of the erroneous information allowed us to identify potential medical risks associated with certain questions; this problem deserves our close attention. %M 37561097 %R 10.2196/48433 %U https://mededu.jmir.org/2023/1/e48433 %U https://doi.org/10.2196/48433 %U http://www.ncbi.nlm.nih.gov/pubmed/37561097 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47427 %T Using ChatGPT as a Learning Tool in Acupuncture Education: Comparative Study %A Lee,Hyeonhoon %+ Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea, 82 2 2072 4627, hhoon@snu.ac.kr %K ChatGPT %K educational tool %K artificial intelligence %K acupuncture %K AI %K personalized education %K students %D 2023 %7 17.8.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT (Open AI) is a state-of-the-art artificial intelligence model with potential applications in the medical fields of clinical practice, research, and education. Objective: This study aimed to evaluate the potential of ChatGPT as an educational tool in college acupuncture programs, focusing on its ability to support students in learning acupuncture point selection, treatment planning, and decision-making. Methods: We collected case studies published in Acupuncture in Medicine between June 2022 and May 2023. Both ChatGPT-3.5 and ChatGPT-4 were used to generate suggestions for acupuncture points based on case presentations. A Wilcoxon signed-rank test was conducted to compare the number of acupuncture points generated by ChatGPT-3.5 and ChatGPT-4, and the overlapping ratio of acupuncture points was calculated. Results: Among the 21 case studies, 14 studies were included for analysis. ChatGPT-4 generated significantly more acupuncture points (9.0, SD 1.1) compared to ChatGPT-3.5 (5.6, SD 0.6; P<.001). The overlapping ratios of acupuncture points for ChatGPT-3.5 (0.40, SD 0.28) and ChatGPT-4 (0.34, SD 0.27; P=.67) were not significantly different. Conclusions: ChatGPT may be a useful educational tool for acupuncture students, providing valuable insights into personalized treatment plans. However, it cannot fully replace traditional diagnostic methods, and further studies are needed to ensure its safe and effective implementation in acupuncture education. %M 37590034 %R 10.2196/47427 %U https://mededu.jmir.org/2023/1/e47427 %U https://doi.org/10.2196/47427 %U http://www.ncbi.nlm.nih.gov/pubmed/37590034 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48978 %T Performance of ChatGPT on the Situational Judgement Test—A Professional Dilemmas–Based Examination for Doctors in the United Kingdom %A Borchert,Robin J %A Hickman,Charlotte R %A Pepys,Jack %A Sadler,Timothy J %+ Department of Radiology, University of Cambridge, Hills Road, Cambridge, CB2 0QQ, United Kingdom, 1 1223 805000, rb729@medschl.cam.ac.uk %K ChatGPT %K language models %K Situational Judgement Test %K medical education %K artificial intelligence %K language model %K exam %K examination %K SJT %K judgement %K reasoning %K communication %K chatbot %D 2023 %7 7.8.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors. Objective: We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics. Methods: All questions from the UK Foundation Programme Office’s (UKFPO’s) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT’s answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated. Results: Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT’s situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors. Conclusions: Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics. %M 37548997 %R 10.2196/48978 %U https://mededu.jmir.org/2023/1/e48978 %U https://doi.org/10.2196/48978 %U http://www.ncbi.nlm.nih.gov/pubmed/37548997 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50336 %T Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations %A Gilson,Aidan %A Safranek,Conrad W %A Huang,Thomas %A Socrates,Vimig %A Chi,Ling %A Taylor,Richard Andrew %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 100 College Street, 9th Fl, New Haven, CT, 06510, United States, 1 203 737 5379, david.chartash@yale.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K AI %K education technology %K ChatGPT %K conversational agent %K machine learning %K large language models %K knowledge assessment %D 2023 %7 13.7.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X %M 37440299 %R 10.2196/50336 %U https://mededu.jmir.org/2023/1/e50336 %U https://doi.org/10.2196/50336 %U http://www.ncbi.nlm.nih.gov/pubmed/37440299 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48305 %T Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” %A Epstein,Richard H %A Dexter,Franklin %+ Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, 1400 NW 12th Ave, Suite 4022F, Miami, FL, 33136, United States, 1 215 896 7850, repstein@med.miami.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K AI %K education technology %K ChatGPT %K Google Bard %K conversational agent %K machine learning %K large language models %K knowledge assessment %D 2023 %7 13.7.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X %M 37440293 %R 10.2196/48305 %U https://mededu.jmir.org/2023/1/e48305 %U https://doi.org/10.2196/48305 %U http://www.ncbi.nlm.nih.gov/pubmed/37440293 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46344 %T Data Science as a Core Competency in Undergraduate Medical Education in the Age of Artificial Intelligence in Health Care %A Seth,Puneet %A Hueppchen,Nancy %A Miller,Steven D %A Rudzicz,Frank %A Ding,Jerry %A Parakh,Kapil %A Record,Janet D %+ Department of Family Medicine, McMaster University, 100 Main Street West, 6th Floor, Hamilton, ON, L8P 1H6, Canada, 1 4166715114, sethp1@mcmaster.ca %K data science %K medical education %K machine learning %K health data %K artificial intelligence %K AI %K application %K health care delivery %K health care %K develop %K medical educators %K physician %K education %K training %K barriers %K optimize %K integration %K competency %D 2023 %7 11.7.2023 %9 Viewpoint %J JMIR Med Educ %G English %X The increasingly sophisticated and rapidly evolving application of artificial intelligence in medicine is transforming how health care is delivered, highlighting a need for current and future physicians to develop basic competency in the data science that underlies this topic. Medical educators must consider how to incorporate central concepts in data science into their core curricula to train physicians of the future. Similar to how the advent of diagnostic imaging required the physician to understand, interpret, and explain the relevant results to patients, physicians of the future should be able to explain to patients the benefits and limitations of management plans guided by artificial intelligence. We outline major content domains and associated learning outcomes in data science applicable to medical student curricula, suggest ways to incorporate these themes into existing curricula, and note potential implementation barriers and solutions to optimize the integration of this content. %M 37432728 %R 10.2196/46344 %U https://mededu.jmir.org/2023/1/e46344 %U https://doi.org/10.2196/46344 %U http://www.ncbi.nlm.nih.gov/pubmed/37432728 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46939 %T Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study %A Nov,Oded %A Singh,Nina %A Mann,Devin %+ Department of Technology Management, Tandon School of Engineering, New York University, 5 Metrotech, Brooklyn, New York, NY, 11201, United States, 1 646 207 7864, onov@nyu.edu %K artificial intelligence %K AI %K ChatGPT %K large language model %K patient-provider interaction %K chatbot %K feasibility %K ethics %K privacy %K language model %K machine learning %D 2023 %7 10.7.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Chatbots are being piloted to draft responses to patient questions, but patients’ ability to distinguish between provider and chatbot responses and patients’ trust in chatbots’ functions are not well established. Objective: This study aimed to assess the feasibility of using ChatGPT (Chat Generative Pre-trained Transformer) or a similar artificial intelligence–based chatbot for patient-provider communication. Methods: A survey study was conducted in January 2023. Ten representative, nonadministrative patient-provider interactions were extracted from the electronic health record. Patients’ questions were entered into ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider’s response. In the survey, each patient question was followed by a provider- or ChatGPT-generated response. Participants were informed that 5 responses were provider generated and 5 were chatbot generated. Participants were asked—and incentivized financially—to correctly identify the response source. Participants were also asked about their trust in chatbots’ functions in patient-provider communication, using a Likert scale from 1-5. Results: A US-representative sample of 430 study participants aged 18 and older were recruited on Prolific, a crowdsourcing platform for academic studies. In all, 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. Overall, 53.3% (209/392) of respondents analyzed were women, and the average age was 47.1 (range 18-91) years. The correct classification of responses ranged between 49% (192/392) to 85.7% (336/392) for different questions. On average, chatbot responses were identified correctly in 65.5% (1284/1960) of the cases, and human provider responses were identified correctly in 65.1% (1276/1960) of the cases. On average, responses toward patients’ trust in chatbots’ functions were weakly positive (mean Likert score 3.4 out of 5), with lower trust as the health-related complexity of the task in the questions increased. Conclusions: ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower-risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in health care. %M 37428540 %R 10.2196/46939 %U https://mededu.jmir.org/2023/1/e46939 %U https://doi.org/10.2196/46939 %U http://www.ncbi.nlm.nih.gov/pubmed/37428540 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48002 %T Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study %A Takagi,Soshi %A Watari,Takashi %A Erabi,Ayano %A Sakaguchi,Kota %+ General Medicine Center, Shimane University Hospital, 89-1, Enya, Izumo, 693-8501, Japan, 81 0853 20 2217, wataritari@gmail.com %K ChatGPT %K Chat Generative Pre-trained Transformer %K GPT-4 %K Generative Pre-trained Transformer 4 %K artificial intelligence %K AI %K medical education %K Japanese Medical Licensing Examination %K medical licensing %K clinical support %K learning model %D 2023 %7 29.6.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: The competence of ChatGPT (Chat Generative Pre-Trained Transformer) in non-English languages is not well studied. Objective: This study compared the performances of GPT-3.5 (Generative Pre-trained Transformer) and GPT-4 on the Japanese Medical Licensing Examination (JMLE) to evaluate the reliability of these models for clinical reasoning and medical knowledge in non-English languages. Methods: This study used the default mode of ChatGPT, which is based on GPT-3.5; the GPT-4 model of ChatGPT Plus; and the 117th JMLE in 2023. A total of 254 questions were included in the final analysis, which were categorized into 3 types, namely general, clinical, and clinical sentence questions. Results: The results indicated that GPT-4 outperformed GPT-3.5 in terms of accuracy, particularly for general, clinical, and clinical sentence questions. GPT-4 also performed better on difficult questions and specific disease questions. Furthermore, GPT-4 achieved the passing criteria for the JMLE, indicating its reliability for clinical reasoning and medical knowledge in non-English languages. Conclusions: GPT-4 could become a valuable tool for medical education and clinical support in non–English-speaking regions, such as Japan. %M 37384388 %R 10.2196/48002 %U https://mededu.jmir.org/2023/1/e48002 %U https://doi.org/10.2196/48002 %U http://www.ncbi.nlm.nih.gov/pubmed/37384388 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48163 %T The Advent of Generative Language Models in Medical Education %A Karabacak,Mert %A Ozkara,Burak Berksu %A Margetis,Konstantinos %A Wintermark,Max %A Bisdas,Sotirios %+ Department of Neuroradiology, The National Hospital for Neurology and Neurosurgery, University College London NHS Foundation Trust, National Hospital for Neurology and Neurosurgery, Queen Square, London, WC1N 3BG, United Kingdom, 44 020 3448 3446, s.bisdas@ucl.ac.uk %K generative language model %K artificial intelligence %K medical education %K ChatGPT %K academic integrity %K AI-driven feedback %K stimulation %K evaluation %K technology %K learning environment %K medical student %D 2023 %7 6.6.2023 %9 Viewpoint %J JMIR Med Educ %G English %X Artificial intelligence (AI) and generative language models (GLMs) present significant opportunities for enhancing medical education, including the provision of realistic simulations, digital patients, personalized feedback, evaluation methods, and the elimination of language barriers. These advanced technologies can facilitate immersive learning environments and enhance medical students' educational outcomes. However, ensuring content quality, addressing biases, and managing ethical and legal concerns present obstacles. To mitigate these challenges, it is necessary to evaluate the accuracy and relevance of AI-generated content, address potential biases, and develop guidelines and policies governing the use of AI-generated content in medical education. Collaboration among educators, researchers, and practitioners is essential for developing best practices, guidelines, and transparent AI models that encourage the ethical and responsible use of GLMs and AI in medical education. By sharing information about the data used for training, obstacles encountered, and evaluation methods, developers can increase their credibility and trustworthiness within the medical community. In order to realize the full potential of AI and GLMs in medical education while mitigating potential risks and obstacles, ongoing research and interdisciplinary collaboration are necessary. By collaborating, medical professionals can ensure that these technologies are effectively and responsibly integrated, contributing to enhanced learning experiences and patient care. %M 37279048 %R 10.2196/48163 %U https://mededu.jmir.org/2023/1/e48163 %U https://doi.org/10.2196/48163 %U http://www.ncbi.nlm.nih.gov/pubmed/37279048 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48291 %T Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions %A Abd-alrazaq,Alaa %A AlSaad,Rawan %A Alhuwail,Dari %A Ahmed,Arfan %A Healy,Padraig Mark %A Latifi,Syed %A Aziz,Sarah %A Damseh,Rafat %A Alabed Alrazak,Sadam %A Sheikh,Javaid %+ AI Center for Precision Health, Weill Cornell Medicine-Qatar, PO Box 5825, Doha Al Luqta St, Ar-Rayyan, Doha, NA, Qatar, 974 55708549, alaa_alzoubi88@yahoo.com %K large language models %K artificial intelligence %K medical education %K ChatGPT %K GPT-4 %K generative AI %K students %K educators %D 2023 %7 1.6.2023 %9 Viewpoint %J JMIR Med Educ %G English %X The integration of large language models (LLMs), such as those in the Generative Pre-trained Transformers (GPT) series, into medical education has the potential to transform learning experiences for students and elevate their knowledge, skills, and competence. Drawing on a wealth of professional and academic experience, we propose that LLMs hold promise for revolutionizing medical curriculum development, teaching methodologies, personalized study plans and learning materials, student assessments, and more. However, we also critically examine the challenges that such integration might pose by addressing issues of algorithmic bias, overreliance, plagiarism, misinformation, inequity, privacy, and copyright concerns in medical education. As we navigate the shift from an information-driven educational paradigm to an artificial intelligence (AI)–driven educational paradigm, we argue that it is paramount to understand both the potential and the pitfalls of LLMs in medical education. This paper thus offers our perspective on the opportunities and challenges of using LLMs in this context. We believe that the insights gleaned from this analysis will serve as a foundation for future recommendations and best practices in the field, fostering the responsible and effective use of AI technologies in medical education. %M 37261894 %R 10.2196/48291 %U https://mededu.jmir.org/2023/1/e48291 %U https://doi.org/10.2196/48291 %U http://www.ncbi.nlm.nih.gov/pubmed/37261894 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47737 %T Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations %A Giannos,Panagiotis %A Delardas,Orestis %+ Department of Life Sciences, Faculty of Natural Sciences, Imperial College London, South Kensington, London, SW7 2AZ, United Kingdom, 44 7765071907, panagiotis.giannos19@imperial.ac.uk %K standardized admissions tests %K GPT %K ChatGPT %K medical education %K medicine %K law %K natural language processing %K BMAT %K TMUA %K LNAT %K TSA %D 2023 %7 26.4.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models, such as ChatGPT by OpenAI, have demonstrated potential in various applications, including medical education. Previous studies have assessed ChatGPT’s performance in university or professional settings. However, the model’s potential in the context of standardized admission tests remains unexplored. Objective: This study evaluated ChatGPT’s performance on standardized admission tests in the United Kingdom, including the BioMedical Admissions Test (BMAT), Test of Mathematics for University Admission (TMUA), Law National Aptitude Test (LNAT), and Thinking Skills Assessment (TSA), to understand its potential as an innovative tool for education and test preparation. Methods: Recent public resources (2019-2022) were used to compile a data set of 509 questions from the BMAT, TMUA, LNAT, and TSA covering diverse topics in aptitude, scientific knowledge and applications, mathematical thinking and reasoning, critical thinking, problem-solving, reading comprehension, and logical reasoning. This evaluation assessed ChatGPT’s performance using the legacy GPT-3.5 model, focusing on multiple-choice questions for consistency. The model’s performance was analyzed based on question difficulty, the proportion of correct responses when aggregating exams from all years, and a comparison of test scores between papers of the same exam using binomial distribution and paired-sample (2-tailed) t tests. Results: The proportion of correct responses was significantly lower than incorrect ones in BMAT section 2 (P<.001) and TMUA paper 1 (P<.001) and paper 2 (P<.001). No significant differences were observed in BMAT section 1 (P=.2), TSA section 1 (P=.7), or LNAT papers 1 and 2, section A (P=.3). ChatGPT performed better in BMAT section 1 than section 2 (P=.047), with a maximum candidate ranking of 73% compared to a minimum of 1%. In the TMUA, it engaged with questions but had limited accuracy and no performance difference between papers (P=.6), with candidate rankings below 10%. In the LNAT, it demonstrated moderate success, especially in paper 2’s questions; however, student performance data were unavailable. TSA performance varied across years with generally moderate results and fluctuating candidate rankings. Similar trends were observed for easy to moderate difficulty questions (BMAT section 1, P=.3; BMAT section 2, P=.04; TMUA paper 1, P<.001; TMUA paper 2, P=.003; TSA section 1, P=.8; and LNAT papers 1 and 2, section A, P>.99) and hard to challenging ones (BMAT section 1, P=.7; BMAT section 2, P<.001; TMUA paper 1, P=.007; TMUA paper 2, P<.001; TSA section 1, P=.3; and LNAT papers 1 and 2, section A, P=.2). Conclusions: ChatGPT shows promise as a supplementary tool for subject areas and test formats that assess aptitude, problem-solving and critical thinking, and reading comprehension. However, its limitations in areas such as scientific and mathematical knowledge and applications highlight the need for continuous development and integration with conventional learning strategies in order to fully harness its potential. %M 37099373 %R 10.2196/47737 %U https://mededu.jmir.org/2023/1/e47737 %U https://doi.org/10.2196/47737 %U http://www.ncbi.nlm.nih.gov/pubmed/37099373 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46599 %T Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care %A Thirunavukarasu,Arun James %A Hassan,Refaat %A Mahmood,Shathar %A Sanghera,Rohan %A Barzangi,Kara %A El Mukashfi,Mohanned %A Shah,Sachin %+ University of Cambridge School of Clinical Medicine, Box 111 Cambridge Biomedical Campus, Cambridge, CB2 0SP, United Kingdom, 44 0 1223 336732 ext 3, ajt205@cantab.ac.uk %K ChatGPT %K large language model %K natural language processing %K decision support techniques %K artificial intelligence %K AI %K deep learning %K primary care %K general practice %K family medicine %K chatbot %D 2023 %7 21.4.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. Results: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions: Large language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis. %M 37083633 %R 10.2196/46599 %U https://mededu.jmir.org/2023/1/e46599 %U https://doi.org/10.2196/46599 %U http://www.ncbi.nlm.nih.gov/pubmed/37083633 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e43110 %T What Does DALL-E 2 Know About Radiology? %A Adams,Lisa C %A Busch,Felix %A Truhn,Daniel %A Makowski,Marcus R %A Aerts,Hugo J W L %A Bressem,Keno K %+ Department of Radiology, Charité Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Hindenburgdamm 30, Berlin, 12203, Germany, 49 30 450 527792, keno-kyrill.bressem@charite.de %K DALL-E %K creating images from text %K image creation %K image generation %K transformer language model %K machine learning %K generative model %K radiology %K x-ray %K artificial intelligence %K medical imaging %K text-to-image %K diagnostic imaging %D 2023 %7 16.3.2023 %9 Viewpoint %J J Med Internet Res %G English %X Generative models, such as DALL-E 2 (OpenAI), could represent promising future tools for image generation, augmentation, and manipulation for artificial intelligence research in radiology, provided that these models have sufficient medical domain knowledge. Herein, we show that DALL-E 2 has learned relevant representations of x-ray images, with promising capabilities in terms of zero-shot text-to-image generation of new images, the continuation of an image beyond its original boundaries, and the removal of elements; however, its capabilities for the generation of images with pathological abnormalities (eg, tumors, fractures, and inflammation) or computed tomography, magnetic resonance imaging, or ultrasound images are still limited. The use of generative models for augmenting and generating radiological data thus seems feasible, even if the further fine-tuning and adaptation of these models to their respective domains are required first. %M 36927634 %R 10.2196/43110 %U https://www.jmir.org/2023/1/e43110 %U https://doi.org/10.2196/43110 %U http://www.ncbi.nlm.nih.gov/pubmed/36927634 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46876 %T ChatGPT in Clinical Toxicology %A Sabry Abdel-Messih,Mary %A Kamel Boulos,Maged N %+ School of Medicine, University of Lisbon, Av Prof Egas Moniz MB, Lisbon, 1649-028, Portugal, 351 92 053 1573, mnkboulos@ieee.org %K ChatGPT %K clinical toxicology %K organophosphates %K artificial intelligence %K AI %K medical education %D 2023 %7 8.3.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X ChatGPT has recently been shown to pass the United States Medical Licensing Examination (USMLE). We tested ChatGPT (Feb 13, 2023 release) using a typical clinical toxicology case of acute organophosphate poisoning. ChatGPT fared well in answering all of our queries regarding it. %M 36867743 %R 10.2196/46876 %U https://mededu.jmir.org/2023/1/e46876 %U https://doi.org/10.2196/46876 %U http://www.ncbi.nlm.nih.gov/pubmed/36867743 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46885 %T The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers %A Eysenbach,Gunther %+ JMIR Publications, 130 Queens Quay East, Suite 1100-1102, Toronto, ON, M5A 0P6, Canada, 1 416 786 6970, geysenba@gmail.com %K artificial intelligence %K AI %K ChatGPT %K generative language model %K medical education %K interview %K future of education %D 2023 %7 6.3.2023 %9 Editorial %J JMIR Med Educ %G English %X ChatGPT is a generative language model tool launched by OpenAI on November 30, 2022, enabling the public to converse with a machine on a broad range of topics. In January 2023, ChatGPT reached over 100 million users, making it the fastest-growing consumer application to date. This interview with ChatGPT is part 2 of a larger interview with ChatGPT. It provides a snapshot of the current capabilities of ChatGPT and illustrates the vast potential for medical education, research, and practice but also hints at current problems and limitations. In this conversation with Gunther Eysenbach, the founder and publisher of JMIR Publications, ChatGPT generated some ideas on how to use chatbots in medical education. It also illustrated its capabilities to generate a virtual patient simulation and quizzes for medical students; critiqued a simulated doctor-patient communication and attempts to summarize a research article (which turned out to be fabricated); commented on methods to detect machine-generated text to ensure academic integrity; generated a curriculum for health professionals to learn about artificial intelligence (AI); and helped to draft a call for papers for a new theme issue to be launched in JMIR Medical Education on ChatGPT. The conversation also highlighted the importance of proper “prompting.” Although the language generator does make occasional mistakes, it admits these when challenged. The well-known disturbing tendency of large language models to hallucinate became evident when ChatGPT fabricated references. The interview provides a glimpse into the capabilities and limitations of ChatGPT and the future of AI-supported medical education. Due to the impact of this new technology on medical education, JMIR Medical Education is launching a call for papers for a new e-collection and theme issue. The initial draft of the call for papers was entirely machine generated by ChatGPT, but will be edited by the human guest editors of the theme issue. %M 36863937 %R 10.2196/46885 %U https://mededu.jmir.org/2023/1/e46885 %U https://doi.org/10.2196/46885 %U http://www.ncbi.nlm.nih.gov/pubmed/36863937 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e45312 %T How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment %A Gilson,Aidan %A Safranek,Conrad W %A Huang,Thomas %A Socrates,Vimig %A Chi,Ling %A Taylor,Richard Andrew %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 300 George Street, Suite 501, New Haven, CT, 06511, United States, 1 203 737 5379, david.chartash@yale.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K education technology %K ChatGPT %K conversational agent %K machine learning %K USMLE %D 2023 %7 8.2.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. %M 36753318 %R 10.2196/45312 %U https://mededu.jmir.org/2023/1/e45312 %U https://doi.org/10.2196/45312 %U http://www.ncbi.nlm.nih.gov/pubmed/36753318