TY  - JOUR
AU  - Sutera, Philip
AU  - Bhatia, Rohini
AU  - Lin, Timothy
AU  - Chang, Leslie
AU  - Brown, Andrea
AU  - Jagsi, Reshma
PY  - 2025/6/24
TI  - Generative AI in Medicine: Pioneering Progress or Perpetuating Historical Inaccuracies? Cross-Sectional Study Evaluating Implicit Bias
JO  - JMIR AI
SP  - e56891
VL  - 4
KW  - Artificial Intelligence
KW  - generative artificial intelligence
KW  - workforce diversity
KW  - bias
KW  - historical inequity
KW  - social inequity
KW  - implicit bias
KW  - AI bias
N2  - Background: Generative artificial intelligence (gAI) models, such as DALL-E 2, are promising tools that can generate novel images or artwork based on text input. However, caution is warranted, as these tools generate information based on historical data and are thus at risk of propagating past learned inequities. Women in medicine have routinely been underrepresented in academic and clinical medicine and the stereotype of a male physician persists. Objective: The primary objective is to evaluate implicit bias among gAI across medical specialties. Methods: To evaluate for potential implicit bias, 100 photographs for each medical specialty were generated using the gAI platform DALL-E2. For each specialty, DALL-E2 was queried with ?An American [specialty name].? Our primary endpoint was to compare the gender distribution of gAI photos to the current distribution in the United States. Our secondary endpoint included evaluating the racial distribution. gAI photos were classified according to perceived gender and race based on a unanimous consensus among a diverse group of medical residents. The proportion of gAI women subjects was compared for each medical specialty to the most recent Association of American Medical Colleges report for physician workforce and active residents using ?2 analysis. Results: A total of 1900 photos across 19 medical specialties were generated. Compared to physician workforce data, AI significantly overrepresented women in 7/19 specialties and underrepresented women in 6/19 specialties. Women were significantly underrepresented compared to the physician workforce by 18%, 18%, and 27% in internal medicine, family medicine, and pediatrics, respectively. Compared to current residents, AI significantly underrepresented women in 12/19 specialties, ranging from 10% to 36%. Additionally, women represented <50% of the demographic for 17/19 specialties by gAI. Conclusions: gAI created a sample population of physicians that underrepresented women when compared to both the resident and active physician workforce. Steps must be taken to train datasets in order to represent the diversity of the incoming physician workforce. 
UR  - https://ai.jmir.org/2025/1/e56891
UR  - http://dx.doi.org/10.2196/56891
ID  - info:doi/10.2196/56891
ER  - 

TY  - JOUR
AU  - Luo, Xuexing
AU  - Li, Yiyuan
AU  - Xu, Jing
AU  - Zheng, Zhong
AU  - Ying, Fangtian
AU  - Huang, Guanghui
PY  - 2025/6/23
TI  - AI in Medical Questionnaires: Innovations, Diagnosis, and Implications
JO  - J Med Internet Res
SP  - e72398
VL  - 27
KW  - artificial intelligence
KW  - AI
KW  - medical questionnaires
KW  - questionnaire-based prediction
KW  - questionnaire development
KW  - diagnostic accuracy
UR  - https://www.jmir.org/2025/1/e72398
UR  - http://dx.doi.org/10.2196/72398
UR  - http://www.ncbi.nlm.nih.gov/pubmed/40549427
ID  - info:doi/10.2196/72398
ER  - 

TY  - JOUR
AU  - Mehta, Seysha
AU  - Haddad, N. Eliot
AU  - Burke, Bhavsar Indira
AU  - Majors, K. Alana
AU  - Maeda, Rie
AU  - Burke, M. Sean
AU  - Deshpande, Abhishek
AU  - Nowacki, S. Amy
AU  - Lindenmeyer, C. Christina
AU  - Mehta, Neil
PY  - 2025/6/16
TI  - Assessment of Large Language Model Performance on Medical School Essay-Style Concept Appraisal Questions: Exploratory Study
JO  - JMIR Med Educ
SP  - e72034
VL  - 11
KW  - essay-type questions
KW  - large language models
KW  - generative AI
KW  - Microsoft Copilot
KW  - artificial intelligence
UR  - https://mededu.jmir.org/2025/1/e72034
UR  - http://dx.doi.org/10.2196/72034
ID  - info:doi/10.2196/72034
ER  - 

TY  - JOUR
AU  - Choi, (Anna) ?yung-Eun
AU  - Fitzek, Sebastian
PY  - 2025/6/13
TI  - User and Provider Experiences With Health Education Chatbots: Qualitative Systematic Review
JO  - JMIR Hum Factors
SP  - e60205
VL  - 12
KW  - chatbot
KW  - health education
KW  - behavior change
KW  - user experience
KW  - privacy concerns
KW  - personalization
KW  - qualitative research.
N2  - Background: Chatbots, as dialog-based platforms, have the potential to transform health education and behavior-change interventions. Despite the growing use of chatbots, qualitative insights into user and provider experiences remain underexplored, particularly with respect to experiences and perceptions, adoption factors, and the role of theoretical frameworks in design. Objective: This systematic review of qualitative evidence aims to address three key research questions (RQs): (RQ1) user and provider experiences; (RQ2) facilitators and barriers to adoption; and (RQ3) role of theoretical frameworks. Methods: We systematically searched PubMed, the Cochrane Library, and ScienceDirect from January 1, 2018, to October 1, 2023, for English- or German-language, peer-reviewed qualitative or mixed methods studies. Studies were included if they examined users? or providers? experiences with chatbots in health education or behavior-change contexts. Two reviewers independently screened titles, abstracts, and full texts (Cohen ?=0.82). We used the Joanna Briggs Institute Critical Appraisal Checklist for quality assessment and conducted a reflexive thematic analysis following Braun and Clarke?s framework. Results: Among the 1754 records identified, 27 studies from 10 countries met the inclusion criteria, encompassing 241 qualitative-only participants and 10,802 mixed method participants (657 contributing qualitative data). For RQ1, users emphasized empathy and emotional connection. For RQ2, accessibility and ease of use emerged as facilitators, whereas trust deficits, technical glitches, and cultural misalignment were key barriers. For RQ3, the integration of behavior-change theories emerged as underutilized despite their potential to increase motivation. Conclusions: Chatbots demonstrate strong potential for health education and behavior-change interventions but must address privacy and trust issues, embed robust theoretical underpinnings, and overcome adoption barriers to fully realize their impact. Future directions should include evaluations of cultural adaptability and rigorous ethical considerations in chatbot design. Trial Registration: OSF Registries osf.io/4px23; https://osf.io/4px23 
UR  - https://humanfactors.jmir.org/2025/1/e60205
UR  - http://dx.doi.org/10.2196/60205
ID  - info:doi/10.2196/60205
ER  - 

TY  - JOUR
AU  - Will, John
AU  - Gupta, Mahin
AU  - Zaretsky, Jonah
AU  - Dowlath, Aliesha
AU  - Testa, Paul
AU  - Feldman, Jonah
PY  - 2025/6/4
TI  - Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study
JO  - J Med Internet Res
SP  - e69955
VL  - 27
KW  - patient education
KW  - health literacy
KW  - artificial intelligence
KW  - readability
KW  - health education
N2  - Background: Online accessible patient education materials (PEMs) are essential for patient empowerment. However, studies have shown that these materials often exceed the recommended sixth-grade reading level, making them difficult for many patients to understand. Large language models (LLMs) have the potential to simplify PEMs into more readable educational content. Objective: We sought to evaluate whether 3 LLMs (ChatGPT [OpenAI], Gemini [Google], and Claude [Anthropic PBC]) can optimize the readability of PEMs to the recommended reading level without compromising accuracy. Methods: This cross-sectional study used 60 randomly selected PEMs available online from 3 websites. We prompted LLMs to simplify the reading level of online PEMs. The primary outcome was the readability of the original online PEMs compared with the LLM-simplified versions. Readability scores were calculated using 4 validated indices Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, and Simple Measure of Gobbledygook Index. Accuracy and understandability were also assessed as balancing measures, with understandability measured using the Patient Education Materials Assessment Tool-Understandability (PEMAT-U). Results: The original readability scores for the American Heart Association (AHA), American Cancer Society (ACS), and American Stroke Association (ASA) websites were above the recommended sixth-grade level, with mean grade level scores of 10.7,10.0, and 9.6, respectively. After optimization by the LLMs, readability scores significantly improved across all 3 websites when compared with the original text. Compared with the original website, Wilcoxon signed rank test showed ChatGPT improved the readability to 7.6 from 10.1 (P<.001); Gemini, to 6.6 (P<.001); and Claude, to 5.6 (P<.001). Word counts were significantly reduced by all LLMs, with a decrease from a mean range of 410.9-953.9 words to a mean range of 201.9-248.1 words. None of the ChatGPT LLM-simplified PEMs were inaccurate, while 3.3% of Gemini and Claude LLM-simplified PEMs were inaccurate. Baseline understandability scores, as measured by PEMAT-U, were preserved across all LLM-simplified versions. Conclusions: This cross-sectional study demonstrates that LLMs have the potential to significantly enhance the readability of online PEMs while maintaining accuracy and understandability, making them more accessible to a broader audience. However, variability in model performance and demonstrated inaccuracies underscore the need for human review of LLM output. Further study is needed to explore advanced LLM techniques and models trained for medical content. 
UR  - https://www.jmir.org/2025/1/e69955
UR  - http://dx.doi.org/10.2196/69955
UR  - http://www.ncbi.nlm.nih.gov/pubmed/40465378
ID  - info:doi/10.2196/69955
ER  - 

TY  - JOUR
AU  - Abouzeid, Enjy
AU  - Wassef, Rita
AU  - Jawwad, Ayesha
AU  - Harris, Patricia
PY  - 2025/5/30
TI  - Chatbots? Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
JO  - JMIR Med Educ
SP  - e69521
VL  - 11
KW  - artificial intelligence
KW  - assessment
KW  - Bing
KW  - ChatGPT
KW  - Gemini
KW  - medical education
KW  - single best answer
N2  - Background: Programmatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and comparatively restricted time, is laborious. The application of technological innovations, including artificial intelligence (AI), has been tried to address this challenge. A major concern raised is the validity of the information produced by AI tools, and if not properly verified, it can produce inaccurate and therefore inappropriate assessments. Objective: This study was designed to examine the content validity and consistency of different AI chatbots in creating single best answer (SBA) questions, a refined format of multiple choice questions better suited to assess higher levels of knowledge, for undergraduate medical students. Methods: This study followed 3 steps. First, 3 researchers used a unified prompt script to generate 10 SBA questions across 4 chatbot platforms. Second, assessors evaluated the chatbot outputs for consistency by identifying similarities and differences between users and across chatbots. With 3 assessors and 10 learning objectives, the maximum possible score for any individual chatbot was 30. Third, 7 assessors internally moderated the questions using a rating scale developed by the research team to evaluate scientific accuracy and educational quality. Results: In response to the prompts, all chatbots generated 10 questions each, except Bing, which failed to respond to 1 prompt. ChatGPT-4 exhibited the highest variation in question generation but did not fully satisfy the ?cover test.? Gemini performed well across most evaluation criteria, except for item balance, and relied heavily on the vignette for answers but showed a preference for one answer option. Bing scored low in most evaluation areas but generated appropriately structured lead-in questions. SBA questions from GPT-3.5, Gemini, and ChatGPT-4 had similar Item Content Validity Index and Scale Level Content Validity Index values, while the Krippendorff alpha coefficient was low (0.016). Bing performed poorly in content clarity, overall validity, and item construction accuracy. A 2-way ANOVA without replication revealed statistically significant differences among chatbots and domains (P<.05). However, the Tukey-Kramer HSD (honestly significant difference) post hoc test showed no significant pairwise differences between individual chatbots, as all comparisons had P values >.05 and overlapping CIs. Conclusions: AI chatbots can aid the production of questions aligned with learning objectives, and individual chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBA prompts us to reconsider Bloom?s taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition. 
UR  - https://mededu.jmir.org/2025/1/e69521
UR  - http://dx.doi.org/10.2196/69521
ID  - info:doi/10.2196/69521
ER  - 

TY  - JOUR
AU  - Kühne, Simon
AU  - Jacobsen, Jannes
AU  - Legewie, Nicolas
AU  - Dollmann, Jörg
PY  - 2025/5/27
TI  - Attitudes Toward AI Usage in Patient Health Care: Evidence From a Population Survey Vignette Experiment
JO  - J Med Internet Res
SP  - e70179
VL  - 27
KW  - artificial intelligence
KW  - trust
KW  - public attitudes
KW  - patient health care
KW  - survey research
KW  - vignette experiment
N2  - Background: The integration of artificial intelligence (AI) holds substantial potential to alter diagnostics and treatment in health care settings. However, public attitudes toward AI, including trust and risk perception, are key to its ethical and effective adoption. Despite growing interest, empirical research on the factors shaping public support for AI in health care (particularly in large-scale, representative contexts) remains limited. Objective: This study aimed to investigate public attitudes toward AI in patient health care, focusing on how AI attributes (autonomy, costs, reliability, and transparency) shape perceptions of support, risk, and personalized care. In addition, it examines the moderating role of sociodemographic characteristics (gender, age, educational level, migration background, and subjective health status) in these evaluations. Our study offers novel insights into the relative importance of AI system characteristics for public attitudes and acceptance. Methods: We conducted a factorial vignette experiment with a probability-based survey of 3030 participants from Germany?s general population. Respondents were presented with hypothetical scenarios involving AI applications in diagnosis and treatment in a hospital setting. Linear regression models assessed the relative influence of AI attributes on the dependent variables (support, risk perception, and personalized care), with additional subgroup analyses to explore heterogeneity by sociodemographic characteristics. Results: Mean values between 4.2 and 4.4 on a 1-7 scale indicate a generally neutral to slightly negative stance toward AI integration in terms of general support, risk perception, and personalized care expectations, with responses spanning the full scale from strong support to strong opposition. Among the 4 dimensions, reliability emerges as the most influential factor (percentage of explained variance [EV] of up to 10.5%). Respondents expect AI to not only prevent errors but also exceed current reliability standards while strongly disapproving of nontraceable systems (transparency is another important factor, percentage of EV of up to 4%). Costs and autonomy play a comparatively minor role (percentage of EVs of up to 1.5% and 1.3%), with preferences favoring collaborative AI systems over autonomous ones, and higher costs generally leading to rejection. Heterogeneity analysis reveals limited sociodemographic differences, with education and migration background influencing attitudes toward transparency and autonomy, and gender differences primarily affecting cost-related perceptions. Overall, attitudes do not substantially differ between AI applications in diagnosis versus treatment. Conclusions: Our study fills a critical research gap by identifying the key factors that shape public trust and acceptance of AI in health care, particularly reliability, transparency, and patient-centered approaches. Our findings provide evidence-based recommendations for policy makers, health care providers, and AI developers to enhance trust and accountability, key concerns often overlooked in system development and real-world applications. The study highlights the need for targeted policy and educational initiatives to support the responsible integration of AI in patient care. 
UR  - https://www.jmir.org/2025/1/e70179
UR  - http://dx.doi.org/10.2196/70179
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/70179
ER  - 

TY  - JOUR
AU  - Oftring, S. Zoe
AU  - Deutsch, Kim
AU  - Tolks, Daniel
AU  - Jungmann, Florian
AU  - Kuhn, Sebastian
PY  - 2025/5/26
TI  - Novel Blended Learning on Artificial Intelligence for Medical Students: Qualitative Interview Study
JO  - JMIR Med Educ
SP  - e65220
VL  - 11
KW  - digital transformation
KW  - artificial intelligence
KW  - clinical AI
KW  - chatbot
KW  - digital literacy
KW  - medical education
KW  - medical students
KW  - medical curriculum
KW  - qualitative content analysis
KW  - medical studies
N2  - Background: Artificial intelligence (AI) systems are becoming increasingly relevant in everyday clinical practice, with Food and Drug Administration?approved AI solutions now available in many specialties. This development has far-reaching implications for doctors and the future medical profession, highlighting the need for both practicing physicians and medical students to acquire the knowledge, skills, and attitudes necessary to effectively use and evaluate these technologies. Currently, however, there is limited experience with AI-focused curricular training and continuing education. Objective: This paper first introduces a novel blended learning curriculum including one module on AI for medical students in Germany. Second, this paper presents findings from a qualitative postcourse evaluation of students? knowledge and attitudes toward AI and their overall perception of the course. Methods: Clinical-year medical students can attend a 5-day elective course called ?Medicine in the Digital Age,? which includes one dedicated AI module alongside 4 others on digital doctor-patient communication; digital health applications and smart devices; telemedicine; and virtual/augmented reality and robotics. After course completion, participants were interviewed in semistructured small group interviews. The interview guide was developed deductively from existing evidence and research questions compiled by our group. A subset of interview questions focused on students? knowledge, skills, and attitudes regarding medical AI, and their overall course assessment. Responses were analyzed using Mayring?s qualitative content analysis. This paper reports on the subset of students? statements about their perception and attitudes toward AI and the elective?s general evaluation. Results: We conducted a total of 18 group interviews, in which all 35 (100%) participants (female=11, male=24) from 3 consecutive course runs participated. This produced a total of 214 statements on AI, which were assigned to the 3 main categories ?Areas of Application,? ?Future Work,? and ?Critical Reflection.? The findings indicate that students have a nuanced and differentiated understanding of AI. Additionally, 610 statements concerned the elective?s overall assessment, demonstrating great learning benefits and high levels of acceptance of the teaching concept. All 35 students would recommend the elective to peers. Conclusions: The evaluation demonstrated that the AI module effectively generates competences regarding AI technology, fosters a critical perspective, and prepares medical students to engage with the technology in a differentiated manner. The curriculum is feasible, beneficial, and highly accepted among students, suggesting it could serve as a teaching model for other medical institutions. Given the growing number and impact of medical AI applications, there is a pressing need for more AI-focused curricula and further research on their educational impact. 
UR  - https://mededu.jmir.org/2025/1/e65220
UR  - http://dx.doi.org/10.2196/65220
ID  - info:doi/10.2196/65220
ER  - 

TY  - JOUR
AU  - Cross, Joseph
AU  - Kayalackakom, Tarron
AU  - Robinson, E. Raymond
AU  - Vaughans, Andrea
AU  - Sebastian, Roopa
AU  - Hood, Ricardo
AU  - Lewis, Courtney
AU  - Devaraju, Sumanth
AU  - Honnavar, Prasanna
AU  - Naik, Sheetal
AU  - Joseph, Jillwin
AU  - Anand, Nikhilesh
AU  - Mohammed, Abdalla
AU  - Johnson, Asjah
AU  - Cohen, Eliran
AU  - Adeniji, Teniola
AU  - Nnenna Nnaji, Aisling
AU  - George, Elizabeth Julia
PY  - 2025/5/20
TI  - Assessing ChatGPT?s Capability as a New Age Standardized Patient: Qualitative Study
JO  - JMIR Med Educ
SP  - e63353
VL  - 11
KW  - medical education
KW  - standardized patient
KW  - AI
KW  - ChatGPT
KW  - virtual patient
KW  - assessment
KW  - standardized patients
KW  - LLM
KW  - effectiveness
KW  - medical school
KW  - qualitative
KW  - flexibility
KW  - diagnostic
N2  - Background: Standardized patients (SPs) have been crucial in medical education, offering realistic patient interactions to students. Despite their benefits, SP training is resource-intensive and access can be limited. Advances in artificial intelligence (AI), particularly with large language models such as ChatGPT, present new opportunities for virtual SPs, potentially addressing these limitations. Objectives: This study aims to assess medical students? perceptions and experiences of using ChatGPT as an SP and to evaluate ChatGPT?s effectiveness in performing as a virtual SP in a medical school setting. Methods: This qualitative study, approved by the American University of Antigua Institutional Review Board, involved 9 students (5 females and 4 males, aged 22?48 years) from the American University of Antigua College of Medicine. Students were observed during a live role-play, interacting with ChatGPT as an SP using a predetermined prompt. A structured 15-question survey was administered before and after the interaction. Thematic analysis was conducted on the transcribed and coded responses, with inductive category formation. Results: Thematic analysis identified key themes preinteraction including technology limitations (eg, prompt engineering difficulties), learning efficacy (eg, potential for personalized learning and reduced interview stress), verisimilitude (eg, absence of visual cues), and trust (eg, concerns about AI accuracy). Postinteraction, students noted improvements in prompt engineering, some alignment issues (eg, limited responses on sensitive topics), maintained learning efficacy (eg, convenience and repetition), and continued verisimilitude challenges (eg, lack of empathy and nonverbal cues). No significant trust issues were reported postinteraction. Despite some limitations, students found ChatGPT as a valuable supplement to traditional SPs, enhancing practice flexibility and diagnostic skills. Conclusions: ChatGPT can effectively augment traditional SPs in medical education, offering accessible, flexible practice opportunities. However, it cannot fully replace human SPs due to limitations in verisimilitude and prompt engineering challenges. Integrating prompt engineering into medical curricula and continuous advancements in AI are recommended to enhance the use of virtual SPs. 
UR  - https://mededu.jmir.org/2025/1/e63353
UR  - http://dx.doi.org/10.2196/63353
ID  - info:doi/10.2196/63353
ER  - 

TY  - JOUR
AU  - Ong, Chwen Qi
AU  - Ang, Chin-Siang
AU  - Lai, Ming Nai
AU  - Atun, Rifat
AU  - Car, Josip
PY  - 2025/5/9
TI  - Differences in Expert Perspectives on AI Training in Medical Education: Secondary Analysis of a Multinational Delphi Study
JO  - J Med Internet Res
SP  - e72186
VL  - 27
KW  - artificial intelligence
KW  - medical education
KW  - competencies
KW  - health professions education
KW  - Delphi study
KW  - global health education
KW  - AI
UR  - https://www.jmir.org/2025/1/e72186
UR  - http://dx.doi.org/10.2196/72186
ID  - info:doi/10.2196/72186
ER  - 

TY  - JOUR
AU  - Rodrigues Alessi, Mateus
AU  - Gomes, Augusto Heitor
AU  - Oliveira, Gabriel
AU  - Lopes de Castro, Matheus
AU  - Grenteski, Fabiano
AU  - Miyashiro, Leticia
AU  - do Valle, Camila
AU  - Tozzini Tavares da Silva, Leticia
AU  - Okamoto, Cristina
PY  - 2025/5/8
TI  - Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study
JO  - JMIR AI
SP  - e66552
VL  - 4
KW  - artificial intelligence
KW  - intelligent systems
KW  - biomedical technology
KW  - medical ethics
KW  - exam questions
KW  - academic performance
KW  - AI
KW  - ethics
KW  - medical education
KW  - ChatGPT
KW  - medical exam
KW  - accuracy
KW  - medical student
KW  - observational study
KW  - medical data
KW  - medical school
N2  - Background: Artificial intelligence has advanced significantly in various fields, including medicine, where tools like ChatGPT (GPT) have demonstrated remarkable capabilities in interpreting and synthesizing complex medical data. Since its launch in 2019, GPT has evolved, with version 4.0 offering enhanced processing power, image interpretation, and more accurate responses. In medicine, GPT has been used for diagnosis, research, and education, achieving significant milestones like passing the United States Medical Licensing Examination. Recent studies show that GPT 4.0 outperforms earlier versions and even medical students on medical exams. Objective: This study aimed to evaluate and compare the performance of GPT versions 3.5 and 4.0 on Brazilian Progress Tests (PT) from 2021 to 2023, analyzing their accuracy compared to medical students. Methods: A cross-sectional observational study was conducted using 333 multiple-choice questions from the PT, excluding questions with images and those nullified or repeated. All questions were presented sequentially without modification to their structure. The performance of GPT versions was compared using statistical methods and medical students? scores were included for context. Results: There was a statistically significant difference in total performance scores across the 2021, 2022, and 2023 exams between GPT-3.5 and GPT-4.0 (P=.03). However, this significance did not remain after Bonferroni correction. On average, GPT v3.5 scored 68.4%, whereas v4.0 achieved 87.2%, reflecting an absolute improvement of 18.8% and a relative increase of 27.4% in accuracy. When broken down by subject, the average scores for GPT-3.5 and GPT-4.0, respectively, were as follows: surgery (73.5% vs 88.0%, P=.03), basic sciences (77.5% vs 96.2%, P=.004), internal medicine (61.5% vs 75.1%, P=.14), gynecology and obstetrics (64.5% vs 94.8%, P=.002), pediatrics (58.5% vs 80.0%, P=.02), and public health (77.8% vs 89.6%, P=.02). After Bonferroni correction, only basic sciences and gynecology and obstetrics retained statistically significant differences. Conclusions: GPT-4.0 demonstrates superior accuracy compared to its predecessor in answering medical questions on the PT. These results are similar to other studies, indicating that we are approaching a new revolution in medicine. 
UR  - https://ai.jmir.org/2025/1/e66552
UR  - http://dx.doi.org/10.2196/66552
ID  - info:doi/10.2196/66552
ER  - 

TY  - JOUR
AU  - Tolentino, Raymond
AU  - Hersson-Edery, Fanny
AU  - Yaffe, Mark
AU  - Abbasgholizadeh-Rahimi, Samira
PY  - 2025/4/25
TI  - AIFM-ed Curriculum Framework for Postgraduate Family Medicine Education on Artificial Intelligence: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e66828
VL  - 11
KW  - artificial intelligence
KW  - family medicine
KW  - curriculum
KW  - framework
KW  - postgraduate education
N2  - Background: As health care moves to a more digital environment, there is a growing need to train future family doctors on the clinical uses of artificial intelligence (AI). However, family medicine training in AI has often been inconsistent or lacking. Objective: The aim of the study is to develop a curriculum framework for family medicine postgraduate education on AI called ?Artificial Intelligence Training in Postgraduate Family Medicine Education? (AIFM-ed). Methods: First, we conducted a comprehensive scoping review on existing AI education frameworks guided by the methodological framework developed by Arksey and O?Malley and Joanna Briggs Institute methodological framework for scoping reviews. We adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist for reporting the results. Next, 2 national expert panels were conducted. Panelists included family medicine educators and residents knowledgeable in AI from family medicine residency programs across Canada. Participants were purposively sampled, and panels were held via Zoom, recorded, and transcribed. Data were analyzed using content analysis. We followed the Standards for Reporting Qualitative Research for panels. Results: An integration of the scoping review results and 2 panel discussions of 14 participants led to the development of the AIFM-ed curriculum framework for AI training in postgraduate family medicine education with five key elements: (1) need and purpose of the curriculum, (2) learning objectives, (3) curriculum content, (4) organization of curriculum content, and (5) implementation aspects of the curriculum. Conclusions: Using the results of this study, we developed the AIFM-ed curriculum framework for AI training in postgraduate family medicine education. This framework serves as a structured guide for integrating AI competencies into medical education, ensuring that future family physicians are equipped with the necessary skills to use AI effectively in their clinical practice. Future research should focus on the validation and implementation of the AIFM-ed framework within family medicine education. Institutions also are encouraged to consider adapting the AIFM-ed framework within their own programs, tailoring it to meet the specific needs of their trainees and health care environments. 
UR  - https://mededu.jmir.org/2025/1/e66828
UR  - http://dx.doi.org/10.2196/66828
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/66828
ER  - 

TY  - JOUR
AU  - Arvai, Nora
AU  - Katonai, Gellért
AU  - Mesko, Bertalan
PY  - 2025/4/23
TI  - Health Care Professionals? Concerns About Medical AI and Psychological Barriers and Strategies for Successful Implementation: Scoping Review
JO  - J Med Internet Res
SP  - e66986
VL  - 27
KW  - artificial intelligence
KW  - attitudes
KW  - health care professionals
KW  - digital health
KW  - fear
KW  - anxiety
KW  - reluctance
KW  - resistance
KW  - skepticism
N2  - Background: The rapid progress in the development of artificial intelligence (AI) is having a substantial impact on health care (HC) delivery and the physician-patient interaction. Objective: This scoping review aims to offer a thorough analysis of the current status of integrating AI into medical practice as well as the apprehensions expressed by HC professionals (HCPs) over its application. Methods: This scoping review used the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to examine articles that investigated the apprehensions of HCPs about medical AI. Following the application of inclusion and exclusion criteria, 32 of an initial 217 studies (14.7%) were selected for the final analysis. We aimed to develop an attitude range that accurately captured the unfavorable emotions of HCPs toward medical AI. We achieved this by selecting attitudes and ranking them on a scale that represented the degree of aversion, ranging from mild skepticism to intense fear. The ultimate depiction of the scale was as follows: skepticism, reluctance, anxiety, resistance, and fear. Results: In total, 3 themes were identified through the process of thematic analysis. National surveys performed among HCPs aimed to comprehensively analyze their current emotions, worries, and attitudes regarding the integration of AI in the medical industry. Research on technostress primarily focused on the psychological dimensions of adopting AI, examining the emotional reactions, fears, and difficulties experienced by HCPs when they encountered AI-powered technology. The high-level perspective category included studies that took a broad and comprehensive approach to evaluating overarching themes, trends, and implications related to the integration of AI technology in HC. We discovered 15 sources of attitudes, which we classified into 2 distinct groups: intrinsic and extrinsic. The intrinsic group focused on HCPs? inherent professional identity, encompassing their tasks and capacities. Conversely, the extrinsic group pertained to their patients and the influence of AI on patient care. Next, we examined the shared themes and made suggestions to potentially tackle the problems discovered. Ultimately, we analyzed the results in relation to the attitude scale, assessing the degree to which each attitude was portrayed. Conclusions: The solution to addressing resistance toward medical AI appears to be centered on comprehensive education, the implementation of suitable legislation, and the delineation of roles. Addressing these issues may foster acceptance and optimize AI integration, enhancing HC delivery while maintaining ethical standards. Due to the current prominence and extensive research on regulation, we suggest that further research could be dedicated to education. 
UR  - https://www.jmir.org/2025/1/e66986
UR  - http://dx.doi.org/10.2196/66986
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/66986
ER  - 

TY  - JOUR
AU  - Teng, Joyce
AU  - Novoa, Andres Roberto
AU  - Aleshin, Alexandrovna Maria
AU  - Lester, Jenna
AU  - Seiger, Kira
AU  - Dzuali, Fiatsogbe
AU  - Daneshjou, Roxana
PY  - 2025/4/11
TI  - Authors? Reply: Enhancing AI-Driven Medical Translations: Considerations for Language Concordance
JO  - JMIR Med Educ
SP  - e71721
VL  - 11
KW  - ChatGPT
KW  - artificial intelligence
KW  - language
KW  - translation
KW  - health care disparity
KW  - natural language model
KW  - survey
KW  - patient education
KW  - accessibility
KW  - preference
KW  - human language
KW  - communication
KW  - language-concordant care
UR  - https://mededu.jmir.org/2025/1/e71721
UR  - http://dx.doi.org/10.2196/71721
ID  - info:doi/10.2196/71721
ER  - 

TY  - JOUR
AU  - Quon, Stephanie
AU  - Zhou, Sarah
PY  - 2025/4/11
TI  - Enhancing AI-Driven Medical Translations: Considerations for Language Concordance
JO  - JMIR Med Educ
SP  - e70420
VL  - 11
KW  - letter to the editor
KW  - ChatGPT
KW  - AI
KW  - artificial intelligence
KW  - language
KW  - translation
KW  - health care disparity
KW  - natural language model
KW  - survey
KW  - patient education
KW  - accessibility
KW  - preference
KW  - human language
KW  - communication
KW  - language-concordant care
UR  - https://mededu.jmir.org/2025/1/e70420
UR  - http://dx.doi.org/10.2196/70420
ID  - info:doi/10.2196/70420
ER  - 

TY  - JOUR
AU  - K?yak, Selim Yavuz
AU  - Kononowicz, A. Andrzej
PY  - 2025/4/4
TI  - Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG
JO  - JMIR Form Res
SP  - e65726
VL  - 9
KW  - automatic item generation
KW  - ChatGPT
KW  - artificial intelligence
KW  - large language models
KW  - medical education
KW  - AI
KW  - hybrid
KW  - template-based method
KW  - hybrid AIG
KW  - mixed-method
KW  - multiple-choice question
KW  - multiple-choice
KW  - human-AI collaboration
KW  - human-AI
KW  - algorithm
KW  - expert
N2  - Background: Template-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items. Objective: We aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education. Methods: This is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population. Results: The hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging. Conclusions: The hybrid AIG method transcends the traditional template-based approach by marrying the ?art? that comes from AI as a ?black box? with the ?science? of algorithmic generation under the oversight of expert as a ?marriage registrar?. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education. 
UR  - https://formative.jmir.org/2025/1/e65726
UR  - http://dx.doi.org/10.2196/65726
ID  - info:doi/10.2196/65726
ER  - 

TY  - JOUR
AU  - Cook, A. David
AU  - Overgaard, Joshua
AU  - Pankratz, Shane V.
AU  - Del Fiol, Guilherme
AU  - Aakre, A. Chris
PY  - 2025/4/4
TI  - Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback
JO  - J Med Internet Res
SP  - e68486
VL  - 27
KW  - simulation training
KW  - natural language processing
KW  - computer-assisted instruction
KW  - clinical decision-making
KW  - clinical reasoning
KW  - machine learning
KW  - virtual patient
KW  - natural language generation
N2  - Background: Virtual patients (VPs) are computer screen?based simulations of patient-clinician encounters. VP use is limited by cost and low scalability. Objective: We aimed to show that VPs powered by large language models (LLMs) can generate authentic dialogues, accurately represent patient preferences, and provide personalized feedback on clinical performance. We also explored using LLMs to rate the quality of dialogues and feedback. Methods: We conducted an intrinsic evaluation study rating 60 VP-clinician conversations. We used carefully engineered prompts to direct OpenAI?s generative pretrained transformer (GPT) to emulate a patient and provide feedback. Using 2 outpatient medicine topics (chronic cough diagnosis and diabetes management), each with permutations representing different patient preferences, we created 60 conversations (dialogues plus feedback): 48 with a human clinician and 12 ?self-chat? dialogues with GPT role-playing both the VP and clinician. Primary outcomes were dialogue authenticity and feedback quality, rated using novel instruments for which we conducted a validation study collecting evidence of content, internal structure (reproducibility), relations with other variables, and response process. Each conversation was rated by 3 physicians and by GPT. Secondary outcomes included user experience, bias, patient preferences represented in the dialogues, and conversation features that influenced authenticity. Results: The average cost per conversation was US $0.51 for GPT-4.0-Turbo and US $0.02 for GPT-3.5-Turbo. Mean (SD) conversation ratings, maximum 6, were overall dialogue authenticity 4.7 (0.7), overall user experience 4.9 (0.7), and average feedback quality 4.7 (0.6). For dialogues created using GPT-4.0-Turbo, physician ratings of patient preferences aligned with intended preferences in 20 to 47 of 48 dialogues (42%-98%). Subgroup comparisons revealed higher ratings for dialogues using GPT-4.0-Turbo versus GPT-3.5-Turbo and for human-generated versus self-chat dialogues. Feedback ratings were similar for human-generated versus GPT-generated ratings, whereas authenticity ratings were lower. We did not perceive bias in any conversation. Dialogue features that detracted from authenticity included that GPT was verbose or used atypical vocabulary (93/180, 51.7% of conversations), was overly agreeable (n=56, 31%), repeated the question as part of the response (n=47, 26%), was easily convinced by clinician suggestions (n=35, 19%), or was not disaffected by poor clinician performance (n=32, 18%). For feedback, detractors included excessively positive feedback (n=42, 23%), failure to mention important weaknesses or strengths (n=41, 23%), or factual inaccuracies (n=39, 22%). Regarding validation of dialogue and feedback scores, items were meticulously developed (content evidence), and we confirmed expected relations with other variables (higher ratings for advanced LLMs and human-generated dialogues). Reproducibility was suboptimal, due largely to variation in LLM performance rather than rater idiosyncrasies. Conclusions: LLM-powered VPs can simulate patient-clinician dialogues, demonstrably represent patient preferences, and provide personalized performance feedback. This approach is scalable, globally accessible, and inexpensive. LLM-generated ratings of feedback quality are similar to human ratings. 
UR  - https://www.jmir.org/2025/1/e68486
UR  - http://dx.doi.org/10.2196/68486
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39854611
ID  - info:doi/10.2196/68486
ER  - 

TY  - JOUR
AU  - Zhang, Manlin
AU  - Zhao, Tianyu
PY  - 2025/4/2
TI  - Citation Accuracy Challenges Posed by Large Language Models
JO  - JMIR Med Educ
SP  - e72998
VL  - 11
KW  - chatGPT
KW  - medical education
KW  - Saudi Arabia
KW  - perceptions
KW  - knowledge
KW  - medical students
KW  - faculty
KW  - chatbot
KW  - qualitative study
KW  - artificial intelligence
KW  - AI
KW  - AI-based tools
KW  - universities
KW  - thematic analysis
KW  - learning
KW  - satisfaction
KW  - LLM
KW  - large language model
UR  - https://mededu.jmir.org/2025/1/e72998
UR  - http://dx.doi.org/10.2196/72998
ID  - info:doi/10.2196/72998
ER  - 

TY  - JOUR
AU  - Temsah, Mohamad-Hani
AU  - Al-Eyadhy, Ayman
AU  - Jamal, Amr
AU  - Alhasan, Khalid
AU  - Malki, H. Khalid
PY  - 2025/4/2
TI  - Authors? Reply: Citation Accuracy Challenges Posed by Large Language Models
JO  - JMIR Med Educ
SP  - e73698
VL  - 11
KW  - ChatGPT
KW  - Gemini
KW  - DeepSeek
KW  - medical education
KW  - AI
KW  - artificial intelligence
KW  - Saudi Arabia
KW  - perceptions
KW  - medical students
KW  - faculty
KW  - LLM
KW  - chatbot
KW  - qualitative study
KW  - thematic analysis
KW  - satisfaction
KW  - RAG retrieval-augmented generation
UR  - https://mededu.jmir.org/2025/1/e73698
UR  - http://dx.doi.org/10.2196/73698
ID  - info:doi/10.2196/73698
ER  - 

TY  - JOUR
AU  - Yan, Zelin
AU  - Liu, Jingwen
AU  - Fan, Yihong
AU  - Lu, Shiyuan
AU  - Xu, Dingting
AU  - Yang, Yun
AU  - Wang, Honggang
AU  - Mao, Jie
AU  - Tseng, Hou-Chiang
AU  - Chang, Tao-Hsing
AU  - Chen, Yan
PY  - 2025/3/31
TI  - Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
JO  - J Med Internet Res
SP  - e62857
VL  - 27
KW  - AI-assisted
KW  - patient education
KW  - inflammatory bowel disease
KW  - artificial intelligence
KW  - ChatGPT
KW  - patient communities
KW  - social media
KW  - disease management
KW  - readability
KW  - online health information
KW  - conversational agents
N2  - Background: Although large language models (LLMs) such as ChatGPT show promise for providing specialized information, their quality requires further evaluation. This is especially true considering that these models are trained on internet text and the quality of health-related information available online varies widely. Objective: The aim of this study was to evaluate the performance of ChatGPT in the context of patient education for individuals with chronic diseases, comparing it with that of industry experts to elucidate its strengths and limitations. Methods: This evaluation was conducted in September 2023 by analyzing the responses of ChatGPT and specialist doctors to questions posed by patients with inflammatory bowel disease (IBD). We compared their performance in terms of subjective accuracy, empathy, completeness, and overall quality, as well as readability to support objective analysis. Results: In a series of 1578 binary choice assessments, ChatGPT was preferred in 48.4% (95% CI 45.9%-50.9%) of instances. There were 12 instances where ChatGPT?s responses were unanimously preferred by all evaluators, compared with 17 instances for specialist doctors. In terms of overall quality, there was no significant difference between the responses of ChatGPT (3.98, 95% CI 3.93-4.02) and those of specialist doctors (3.95, 95% CI 3.90-4.00; t524=0.95, P=.34), both being considered ?good.? Although differences in accuracy (t521=0.48, P=.63) and empathy (t511=2.19, P=.03) lacked statistical significance, the completeness of textual output (t509=9.27, P<.001) was a distinct advantage of the LLM (ChatGPT). In the sections of the questionnaire where patients and doctors responded together (Q223-Q242), ChatGPT demonstrated inferior performance (t36=2.91, P=.006). Regarding readability, no statistical difference was found between the responses of specialist doctors (median: 7th grade; Q1: 4th grade; Q3: 8th grade) and those of ChatGPT (median: 7th grade; Q1: 7th grade; Q3: 8th grade) according to the Mann-Whitney U test (P=.09). The overall quality of ChatGPT?s output exhibited strong correlations with other subdimensions (with empathy: r=0.842; with accuracy: r=0.839; with completeness: r=0.795), and there was also a high correlation between the subdimensions of accuracy and completeness (r=0.762). Conclusions: ChatGPT demonstrated more stable performance across various dimensions. Its output of health information content is more structurally sound, addressing the issue of variability in the information from individual specialist doctors. ChatGPT?s performance highlights its potential as an auxiliary tool for health information, despite limitations such as artificial intelligence hallucinations. It is recommended that patients be involved in the creation and evaluation of health information to enhance the quality and relevance of the information. 
UR  - https://www.jmir.org/2025/1/e62857
UR  - http://dx.doi.org/10.2196/62857
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/62857
ER  - 

TY  - JOUR
AU  - Madrid, Julian
AU  - Diehl, Philipp
AU  - Selig, Mischa
AU  - Rolauffs, Bernd
AU  - Hans, Patricius Felix
AU  - Busch, Hans-Jörg
AU  - Scheef, Tobias
AU  - Benning, Leo
PY  - 2025/3/21
TI  - Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
JO  - JMIR Med Educ
SP  - e58375
VL  - 11
KW  - medical education
KW  - artificial intelligence
KW  - generative AI
KW  - large language model
KW  - LLM
KW  - ChatGPT
KW  - GPT-4
KW  - board licensing examination
KW  - professional education
KW  - examination
KW  - student
KW  - experimental
KW  - bootstrapping
KW  - confidence interval
N2  - Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed ?confidence accuracy? to evaluate it. Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain. 
UR  - https://mededu.jmir.org/2025/1/e58375
UR  - http://dx.doi.org/10.2196/58375
ID  - info:doi/10.2196/58375
ER  - 

TY  - JOUR
AU  - Andalib, Saman
AU  - Spina, Aidin
AU  - Picton, Bryce
AU  - Solomon, S. Sean
AU  - Scolaro, A. John
AU  - Nelson, M. Ariana
PY  - 2025/3/21
TI  - Using AI to Translate and Simplify Spanish Orthopedic Medical Text: Instrument Validation Study
JO  - JMIR AI
SP  - e70222
VL  - 4
KW  - large language models
KW  - LLM
KW  - patient education
KW  - translation
KW  - bilingual evaluation understudy
KW  - GPT-4
KW  - Google Translate
N2  - Background: Language barriers contribute significantly to health care disparities in the United States, where a sizable proportion of patients are exclusively Spanish speakers. In orthopedic surgery, such barriers impact both patients? comprehension of and patients? engagement with available resources. Studies have explored the utility of large language models (LLMs) for medical translation but have yet to robustly evaluate artificial intelligence (AI)?driven translation and simplification of orthopedic materials for Spanish speakers. Objective: This study used the bilingual evaluation understudy (BLEU) method to assess translation quality and investigated the ability of AI to simplify patient education materials (PEMs) in Spanish. Methods: PEMs (n=78) from the American Academy of Orthopaedic Surgery were translated from English to Spanish, using 2 LLMs (GPT-4 and Google Translate). The BLEU methodology was applied to compare AI translations with professionally human-translated PEMs. The Friedman test and Dunn multiple comparisons test were used to statistically quantify differences in translation quality. A readability analysis and feature analysis were subsequently performed to evaluate text simplification success and the impact of English text features on BLEU scores. The capability of an LLM to simplify medical language written in Spanish was also assessed. Results: As measured by BLEU scores, GPT-4 showed moderate success in translating PEMs into Spanish but was less successful than Google Translate. Simplified PEMs demonstrated improved readability when compared to original versions (P<.001) but were unable to reach the targeted grade level for simplification. The feature analysis revealed that the total number of syllables and average number of syllables per sentence had the highest impact on BLEU scores. GPT-4 was able to significantly reduce the complexity of medical text written in Spanish (P<.001). Conclusions: Although Google Translate outperformed GPT-4 in translation accuracy, LLMs, such as GPT-4, may provide significant utility in translating medical texts into Spanish and simplifying such texts. We recommend considering a dual approach?using Google Translate for translation and GPT-4 for simplification?to improve medical information accessibility and orthopedic surgery education among Spanish-speaking patients. 
UR  - https://ai.jmir.org/2025/1/e70222
UR  - http://dx.doi.org/10.2196/70222
ID  - info:doi/10.2196/70222
ER  - 

TY  - JOUR
AU  - Tseng, Liang-Wei
AU  - Lu, Yi-Chin
AU  - Tseng, Liang-Chi
AU  - Chen, Yu-Chun
AU  - Chen, Hsing-Yu
PY  - 2025/3/19
TI  - Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e58897
VL  - 11
KW  - artificial intelligence
KW  - AI language understanding tools
KW  - ChatGPT
KW  - natural language processing
KW  - machine learning
KW  - Chinese medicine license exam
KW  - Chinese medical licensing examination
KW  - medical education
KW  - traditional Chinese medicine
KW  - large language model
N2  - Background: The integration of artificial intelligence (AI), notably ChatGPT, into medical education, has shown promising results in various medical fields. Nevertheless, its efficacy in traditional Chinese medicine (TCM) examinations remains understudied. Objective: This study aims to (1) assess the performance of ChatGPT on the TCM licensing examination in Taiwan and (2) evaluate the model?s explainability in answering TCM-related questions to determine its suitability as a TCM learning tool. Methods: We used the GPT-4 model to respond to 480 questions from the 2022 TCM licensing examination. This study compared the performance of the model against that of licensed TCM doctors using 2 approaches, namely direct answer selection and provision of explanations before answer selection. The accuracy and consistency of AI-generated responses were analyzed. Moreover, a breakdown of question characteristics was performed based on the cognitive level, depth of knowledge, types of questions, vignette style, and polarity of questions. Results: ChatGPT achieved an overall accuracy of 43.9%, which was lower than that of 2 human participants (70% and 78.4%). The analysis did not reveal a significant correlation between the accuracy of the model and the characteristics of the questions. An in-depth examination indicated that errors predominantly resulted from a misunderstanding of TCM concepts (55.3%), emphasizing the limitations of the model with regard to its TCM knowledge base and reasoning capability. Conclusions: Although ChatGPT shows promise as an educational tool, its current performance on TCM licensing examinations is lacking. This highlights the need for enhancing AI models with specialized TCM training and suggests a cautious approach to utilizing AI for TCM education. Future research should focus on model improvement and the development of tailored educational applications to support TCM learning. 
UR  - https://mededu.jmir.org/2025/1/e58897
UR  - http://dx.doi.org/10.2196/58897
ID  - info:doi/10.2196/58897
ER  - 

TY  - JOUR
AU  - Pastrak, Mila
AU  - Kajitani, Sten
AU  - Goodings, James Anthony
AU  - Drewek, Austin
AU  - LaFree, Andrew
AU  - Murphy, Adrian
PY  - 2025/3/12
TI  - Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study
JO  - JMIR AI
SP  - e67696
VL  - 4
KW  - artificial intelligence
KW  - ChatGPT-4
KW  - medical education
KW  - emergency medicine
KW  - examination
KW  - examination preparation
N2  - Background: The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear. Objective: This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation. Methods: A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed. Results: The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability. Conclusions: The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form. 
UR  - https://ai.jmir.org/2025/1/e67696
UR  - http://dx.doi.org/10.2196/67696
ID  - info:doi/10.2196/67696
ER  - 

TY  - JOUR
AU  - Monzon, Noahlana
AU  - Hays, Alan Franklin
PY  - 2025/3/11
TI  - Leveraging Generative Artificial Intelligence to Improve Motivation and Retrieval in Higher Education Learners
JO  - JMIR Med Educ
SP  - e59210
VL  - 11
KW  - educational technology
KW  - retrieval practice
KW  - flipped classroom
KW  - cognitive engagement
KW  - personalized learning
KW  - generative artificial intelligence
KW  - higher education
KW  - university education
KW  - learners
KW  - instructors
KW  - curriculum structure
KW  - learning
KW  - technologies
KW  - innovation
KW  - academic misconduct
KW  - gamification
KW  - self-directed
KW  - socio-economic disparities
KW  - interactive approach
KW  - medical education
KW  - chatGPT
KW  - machine learning
KW  - AI
KW  - large language models
UR  - https://mededu.jmir.org/2025/1/e59210
UR  - http://dx.doi.org/10.2196/59210
ID  - info:doi/10.2196/59210
ER  - 

TY  - JOUR
AU  - Zada, Troy
AU  - Tam, Natalie
AU  - Barnard, Francois
AU  - Van Sittert, Marlize
AU  - Bhat, Venkat
AU  - Rambhatla, Sirisha
PY  - 2025/3/10
TI  - Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models
JO  - JMIR Form Res
SP  - e66207
VL  - 9
KW  - ChatGPT
KW  - health care
KW  - LLM
KW  - misinformation
KW  - self-diagnosis
KW  - large language model
N2  - Background: Rapid integration of large language models (LLMs) in health care is sparking global discussion about their potential to revolutionize health care quality and accessibility. At a time when improving health care quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical examinations is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading health care misinformation has not been evaluated. Objective: This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models. Methods: We propose the comprehensive testing methodology evaluation of LLM prompts (EvalPrompt). This evaluation methodology uses multiple-choice medical licensing examination questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and experiment 2 performs sentence dropout on the correct responses from experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT. Results: In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts, with only 34% (32/94) agreement between the 2 groups. Similarly, in experiment 2, which assessed robustness, 61% (92/152) of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed. Conclusions: The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in health care systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields. 
UR  - https://formative.jmir.org/2025/1/e66207
UR  - http://dx.doi.org/10.2196/66207
ID  - info:doi/10.2196/66207
ER  - 

TY  - JOUR
AU  - Kammies, Chamandra
AU  - Archer, Elize
AU  - Engel-Hills, Penelope
AU  - Volschenk, Mariette
PY  - 2025/3/6
TI  - Exploring Curriculum Considerations to Prepare Future Radiographers for an AI-Assisted Health Care Environment: Protocol for Scoping Review
JO  - JMIR Res Protoc
SP  - e60431
VL  - 14
KW  - artificial intelligence
KW  - machine learning
KW  - radiography
KW  - education
KW  - scoping review
N2  - Background: The use of artificial intelligence (AI) technologies in radiography practice is increasing. As this advanced technology becomes more embedded in radiography systems and clinical practice, the role of radiographers will evolve. In the context of these anticipated changes, it may be reasonable to expect modifications to the competencies and educational requirements of current and future practitioners to ensure successful AI adoption. Objective: The aim of this scoping review is to explore and synthesize the literature on the adjustments needed in the radiography curriculum to prepare radiography students for the demands of AI-assisted health care environments. Methods: Using the Joanna Briggs Institute methodology, an initial search was run in Scopus to determine whether the search strategy that was developed with a library specialist would capture the relevant literature by screening the title and abstract of the first 50 articles. Additional search terms identified in the articles were added to the search strategy. Next, EBSCOhost, PubMed, and Web of Science databases were searched. In total, 2 reviewers will independently review the title, abstract, and full-text articles according to the predefined inclusion and exclusion criteria, with conflicts resolved by a third reviewer. Results: The search results will be reported using the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) checklist. The final scoping review will present the data analysis as findings in tabular form and through narrative descriptions. The final database searches were completed in October 2024 and yielded 2224 records. Title and abstract screening of 1930 articles is underway after removing 294 duplicates. The scoping review is expected to be finalized by the end of March 2025. Conclusions: A scoping review aims to systematically map the evidence on the adjustments needed in the radiography curriculum to prepare radiography students for the integration of AI technologies in the health care environment. It is relevant to map the evidence because increased integration of AI-based technologies in clinical practice has been noted and changes in practice must be underpinned by appropriate education and training. The findings in this study will provide a better understanding of how the radiography curriculum should adapt to meet the educational needs of current and future radiographers to ensure competent and safe practice in response to AI technologies. Trial Registration: Open Science Framework 3nx2a; https://osf.io/3nx2a International Registered Report Identifier (IRRID): PRR1-10.2196/60431 
UR  - https://www.researchprotocols.org/2025/1/e60431
UR  - http://dx.doi.org/10.2196/60431
UR  - http://www.ncbi.nlm.nih.gov/pubmed/40053777
ID  - info:doi/10.2196/60431
ER  - 

TY  - JOUR
AU  - Prazeres, Filipe
PY  - 2025/3/5
TI  - ChatGPT?s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini
JO  - JMIR Med Educ
SP  - e65108
VL  - 11
KW  - ChatGPT-3.5 Turbo
KW  - ChatGPT-4o mini
KW  - medical examination
KW  - European Portuguese
KW  - AI performance evaluation
KW  - Portuguese
KW  - evaluation
KW  - medical examination questions
KW  - examination question
KW  - chatbot
KW  - ChatGPT
KW  - model
KW  - artificial intelligence
KW  - AI
KW  - GPT
KW  - LLM
KW  - NLP
KW  - natural language processing
KW  - machine learning
KW  - large language model
N2  - Background: Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness. Objective: This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso à Formação Especializada [PNA]) and compares their performance to human candidates. Methods: ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, ?Are you sure?? after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models? performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05. Results: ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance. Conclusions: This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research. 
UR  - https://mededu.jmir.org/2025/1/e65108
UR  - http://dx.doi.org/10.2196/65108
ID  - info:doi/10.2196/65108
ER  - 

TY  - JOUR
AU  - Doru, Berin
AU  - Maier, Christoph
AU  - Busse, Sophie Johanna
AU  - Lücke, Thomas
AU  - Schönhoff, Judith
AU  - Enax- Krumova, Elena
AU  - Hessler, Steffen
AU  - Berger, Maria
AU  - Tokic, Marianne
PY  - 2025/3/3
TI  - Detecting Artificial Intelligence?Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study
JO  - JMIR Med Educ
SP  - e62779
VL  - 11
KW  - artificial intelligence
KW  - ChatGPT
KW  - large language models
KW  - textual analysis
KW  - writing style
KW  - AI
KW  - chatbot
KW  - LLMs
KW  - detection
KW  - authorship
KW  - medical student
KW  - linguistic quality
KW  - decision-making
KW  - logical coherence
N2  - Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)?generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount. Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity?medical professionals and humanities scholars with expertise in textual analysis?to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features. Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants? characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text?s authorship. Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features?particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)?played a crucial role in participants? decisions to identify a text as AI-generated. Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts? familiarity with the text content. As the decision-making process primarily relies on linguistic attributes?such as stylistic features and text coherence?further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers? ability to distinguish between student-authored and AI-generated work. 
UR  - https://mededu.jmir.org/2025/1/e62779
UR  - http://dx.doi.org/10.2196/62779
UR  - http://www.ncbi.nlm.nih.gov/pubmed/40053752
ID  - info:doi/10.2196/62779
ER  - 

TY  - JOUR
AU  - Scherr, Riley
AU  - Spina, Aidin
AU  - Dao, Allen
AU  - Andalib, Saman
AU  - Halaseh, F. Faris
AU  - Blair, Sarah
AU  - Wiechmann, Warren
AU  - Rivera, Ronald
PY  - 2025/2/27
TI  - Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study
JO  - JMIR Form Res
SP  - e66478
VL  - 9
KW  - medical school simulations
KW  - AI in medical education
KW  - preclinical curriculum
KW  - ChatGPT
KW  - ChatGPT-4
KW  - medical simulation
KW  - simulation
KW  - multimedia
KW  - feedback
KW  - medical education
KW  - medical student
KW  - clinical education
KW  - pilot study
KW  - patient management
N2  - Background: Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT?s reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms. Objective: This study aims to quantify ChatGPT?s ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology. Methods: Using ChatGPT-4 and a prevalidated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. A total of 180 simulations were given correct answers and 180 simulations were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with ?² analyses using 95% CIs for odds ratios. Results: In total, 100% (n=360) of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% (200/360) of all simulations delayed feedback, while the Correct arm (157/180, 87%) delayed feedback was significantly more than the Incorrect arm (43/180, 24%; P<.001). A total of 79% (285/360) of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (146/180, 81% and 139/180, 77%; P=.36). Overall, 78% (282/360) of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (137/180, 76% and 145/180, 81%; P=.31). ChatGPT-4 was not significantly more likely to conclude simulations autonomously (P=.34) and provide comprehensive feedback (P=.27) when feedback was delayed compared to when feedback was not delayed. Conclusions: These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel 9-part metric. Per this metric, ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was not more likely to meet all advanced parameters. Further work must be done to ensure consistent performance across a broader range of simulation scenarios. 
UR  - https://formative.jmir.org/2025/1/e66478
UR  - http://dx.doi.org/10.2196/66478
ID  - info:doi/10.2196/66478
ER  - 

TY  - JOUR
AU  - Abouammoh, Noura
AU  - Alhasan, Khalid
AU  - Aljamaan, Fadi
AU  - Raina, Rupesh
AU  - Malki, H. Khalid
AU  - Altamimi, Ibraheem
AU  - Muaygil, Ruaim
AU  - Wahabi, Hayfaa
AU  - Jamal, Amr
AU  - Alhaboob, Ali
AU  - Assiri, Assad Rasha
AU  - Al-Tawfiq, A. Jaffar
AU  - Al-Eyadhy, Ayman
AU  - Soliman, Mona
AU  - Temsah, Mohamad-Hani
PY  - 2025/2/20
TI  - Perceptions and Earliest Experiences of Medical Students and Faculty With ChatGPT in Medical Education: Qualitative Study
JO  - JMIR Med Educ
SP  - e63400
VL  - 11
KW  - ChatGPT
KW  - medical education
KW  - Saudi Arabia
KW  - perceptions
KW  - knowledge
KW  - medical students
KW  - faculty
KW  - chatbot
KW  - qualitative study
KW  - artificial intelligence
KW  - AI
KW  - AI-based tools
KW  - universities
KW  - thematic analysis
KW  - learning
KW  - satisfaction
N2  - Background: With the rapid development of artificial intelligence technologies, there is a growing interest in the potential use of artificial intelligence?based tools like ChatGPT in medical education. However, there is limited research on the initial perceptions and experiences of faculty and students with ChatGPT, particularly in Saudi Arabia. Objective: This study aimed to explore the earliest knowledge, perceived benefits, concerns, and limitations of using ChatGPT in medical education among faculty and students at a leading Saudi Arabian university. Methods: A qualitative exploratory study was conducted in April 2023, involving focused meetings with medical faculty and students with varying levels of ChatGPT experience. A thematic analysis was used to identify key themes and subthemes emerging from the discussions. Results: Participants demonstrated good knowledge of ChatGPT and its functions. The main themes were perceptions of ChatGPT use, potential benefits, and concerns about ChatGPT in research and medical education. The perceived benefits included collecting and summarizing information and saving time and effort. However, concerns and limitations centered around the potential lack of critical thinking in the information provided, the ambiguity of references, limitations of access, trust in the output of ChatGPT, and ethical concerns. Conclusions: This study provides valuable insights into the perceptions and experiences of medical faculty and students regarding the use of newly introduced large language models like ChatGPT in medical education. While the benefits of ChatGPT were recognized, participants also expressed concerns and limitations requiring further studies for effective integration into medical education, exploring the impact of ChatGPT on learning outcomes, student and faculty satisfaction, and the development of critical thinking skills. 
UR  - https://mededu.jmir.org/2025/1/e63400
UR  - http://dx.doi.org/10.2196/63400
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39977012
ID  - info:doi/10.2196/63400
ER  - 

TY  - JOUR
AU  - Potter, Alison
AU  - Munsch, Chris
AU  - Watson, Elaine
AU  - Hopkins, Emily
AU  - Kitromili, Sofia
AU  - O'Neill, Cameron Iain
AU  - Larbie, Judy
AU  - Niittymaki, Essi
AU  - Ramsay, Catriona
AU  - Burke, Joshua
AU  - Ralph, Neil
PY  - 2025/2/19
TI  - Identifying Research Priorities in Digital Education for Health Care: Umbrella Review and Modified Delphi Method Study
JO  - J Med Internet Res
SP  - e66157
VL  - 27
KW  - digital education
KW  - health professions education
KW  - research priorities
KW  - umbrella review
KW  - Delphi
KW  - artificial intelligence
KW  - AI
N2  - Background: In recent years, the use of digital technology in the education of health care professionals has surged, partly driven by the COVID-19 pandemic. However, there is still a need for focused research to establish evidence of its effectiveness. Objective: This study aimed to define the gaps in the evidence for the efficacy of digital education and to identify priority areas where future research has the potential to contribute to our understanding and use of digital education. Methods: We used a 2-stage approach to identify research priorities. First, an umbrella review of the recent literature (published between 2020 and 2023) was performed to identify and build on existing work. Second, expert consensus on the priority research questions was obtained using a modified Delphi method. Results: A total of 8857 potentially relevant papers were identified. Using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology, we included 217 papers for full review. All papers were either systematic reviews or meta-analyses. A total of 151 research recommendations were extracted from the 217 papers. These were analyzed, recategorized, and consolidated to create a final list of 63 questions. From these, a modified Delphi process with 42 experts was used to produce the top-five rated research priorities: (1) How do we measure the learning transfer from digital education into the clinical setting? (2) How can we optimize the use of artificial intelligence, machine learning, and deep learning to facilitate education and training? (3) What are the methodological requirements for high-quality rigorous studies assessing the outcomes of digital health education? (4) How does the design of digital education interventions (eg, format and modality) in health professionals? education and training curriculum affect learning outcomes? and (5) How should learning outcomes in the field of health professions? digital education be defined and standardized? Conclusions: This review provides a prioritized list of research gaps in digital education in health care, which will be of use to researchers, educators, education providers, and funding agencies. Additional proposals are discussed regarding the next steps needed to advance this agenda, aiming to promote meaningful and practical research on the use of digital technologies and drive excellence in health care education. 
UR  - https://www.jmir.org/2025/1/e66157
UR  - http://dx.doi.org/10.2196/66157
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39969988
ID  - info:doi/10.2196/66157
ER  - 

TY  - JOUR
AU  - Chow, L. James C.
AU  - Li, Kay
PY  - 2025/2/18
TI  - Developing Effective Frameworks for Large Language Model?Based Medical Chatbots: Insights From Radiotherapy Education With ChatGPT
JO  - JMIR Cancer
SP  - e66633
VL  - 11
KW  - artificial intelligence
KW  - AI
KW  - AI in medical education
KW  - radiotherapy chatbot
KW  - large language models
KW  - LLMs
KW  - medical chatbots
KW  - health care AI
KW  - ethical AI in health care
KW  - personalized learning
KW  - natural language processing
KW  - NLP
KW  - radiotherapy education
KW  - AI-driven learning tools
UR  - https://cancer.jmir.org/2025/1/e66633
UR  - http://dx.doi.org/10.2196/66633
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/66633
ER  - 

TY  - JOUR
AU  - Ichikawa, Tsunagu
AU  - Olsen, Elizabeth
AU  - Vinod, Arathi
AU  - Glenn, Noah
AU  - Hanna, Karim
AU  - Lund, C. Gregg
AU  - Pierce-Talsma, Stacey
PY  - 2025/2/11
TI  - Generative Artificial Intelligence in Medical Education?Policies and Training at US Osteopathic Medical Schools: Descriptive Cross-Sectional Survey
JO  - JMIR Med Educ
SP  - e58766
VL  - 11
KW  - artificial intelligence
KW  - medical education
KW  - faculty development
KW  - policy
KW  - AI
KW  - training
KW  - United States
KW  - school
KW  - university
KW  - college
KW  - institution
KW  - osteopathic
KW  - osteopathy
KW  - curriculum
KW  - student
KW  - faculty
KW  - administrator
KW  - survey
KW  - cross-sectional
N2  - Background: Interest has recently increased in generative artificial intelligence (GenAI), a subset of artificial intelligence that can create new content. Although the publicly available GenAI tools are not specifically trained in the medical domain, they have demonstrated proficiency in a wide range of medical assessments. The future integration of GenAI in medicine remains unknown. However, the rapid availability of GenAI with a chat interface and the potential risks and benefits are the focus of great interest. As with any significant medical advancement or change, medical schools must adapt their curricula to equip students with the skills necessary to become successful physicians. Furthermore, medical schools must ensure that faculty members have the skills to harness these new opportunities to increase their effectiveness as educators. How medical schools currently fulfill their responsibilities is unclear. Colleges of Osteopathic Medicine (COMs) in the United States currently train a significant proportion of the total number of medical students. These COMs are in academic settings ranging from large public research universities to small private institutions. Therefore, studying COMs will offer a representative sample of the current GenAI integration in medical education. Objective: This study aims to describe the policies and training regarding the specific aspect of GenAI in US COMs, targeting students, faculty, and administrators. Methods: Web-based surveys were sent to deans and Student Government Association (SGA) presidents of the main campuses of fully accredited US COMs. The dean survey included questions regarding current and planned policies and training related to GenAI for students, faculty, and administrators. The SGA president survey included only those questions related to current student policies and training. Results: Responses were received from 81% (26/32) of COMs surveyed. This included 47% (15/32) of the deans and 50% (16/32) of the SGA presidents (with 5 COMs represented by both the deans and the SGA presidents). Most COMs did not have a policy on the student use of GenAI, as reported by the dean (14/15, 93%) and the SGA president (14/16, 88%). Of the COMs with no policy, 79% (11/14) had no formal plans for policy development. Only 1 COM had training for students, which focused entirely on the ethics of using GenAI. Most COMs had no formal plans to provide mandatory (11/14, 79%) or elective (11/15, 73%) training. No COM had GenAI policies for faculty or administrators. Eighty percent had no formal plans for policy development. Furthermore, 33.3% (5/15) of COMs had faculty or administrator GenAI training. Except for examination question development, there was no training to increase faculty or administrator capabilities and efficiency or to decrease their workload. Conclusions: The survey revealed that most COMs lack GenAI policies and training for students, faculty, and administrators. The few institutions with policies or training were extremely limited in scope. Most institutions without current training or policies had no formal plans for development. The lack of current policies and training initiatives suggests inadequate preparedness for integrating GenAI into the medical school environment, therefore, relegating the responsibility for ethical guidance and training to the individual COM member. 
UR  - https://mededu.jmir.org/2025/1/e58766
UR  - http://dx.doi.org/10.2196/58766
ID  - info:doi/10.2196/58766
ER  - 

TY  - JOUR
AU  - Burisch, Christian
AU  - Bellary, Abhav
AU  - Breuckmann, Frank
AU  - Ehlers, Jan
AU  - Thal, C. Serge
AU  - Sellmann, Timur
AU  - Gödde, Daniel
PY  - 2025/2/6
TI  - ChatGPT-4 Performance on German Continuing Medical Education?Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial
JO  - JMIR Res Protoc
SP  - e63887
VL  - 14
KW  - ChatGPT
KW  - artificial intelligence
KW  - large language model
KW  - postgraduate education
KW  - continuing medical education
KW  - self-assessment program
N2  - Background: The increasing development and spread of artificial and assistive intelligence is opening up new areas of application not only in applied medicine but also in related fields such as continuing medical education (CME), which is part of the mandatory training program for medical doctors in Germany. This study aimed to determine whether medical laypersons can successfully conduct training courses specifically for physicians with the help of a large language model (LLM) such as ChatGPT-4. This study aims to qualitatively and quantitatively investigate the impact of using artificial intelligence (AI; specifically ChatGPT) on the acquisition of credit points in German postgraduate medical education. Objective: Using this approach, we wanted to test further possible applications of AI in the postgraduate medical education setting and obtain results for practical use. Depending on the results, the potential influence of LLMs such as ChatGPT-4 on CME will be discussed, for example, as part of a SWOT (strengths, weaknesses, opportunities, threats) analysis. Methods: We designed a randomized controlled trial, in which adult high school students attempt to solve CME tests across six medical specialties in three study arms in total with 18 CME training courses per study arm under different interventional conditions with varying amounts of permitted use of ChatGPT-4. Sample size calculation was performed including guess probability (20% correct answers, SD=40%; confidence level of 1??=.95/?=.05; test power of 1??=.95; P<.05). The study was registered at open scientific framework. Results: As of October 2024, the acquisition of data and students to participate in the trial is ongoing. Upon analysis of our acquired data, we predict our findings to be ready for publication as soon as early 2025. Conclusions: We aim to prove that the advances in AI, especially LLMs such as ChatGPT-4 have considerable effects on medical laypersons? ability to successfully pass CME tests. The implications that this holds on how the concept of continuous medical education requires reevaluation are yet to be contemplated. Trial Registration: OSF Registries 10.17605/OSF.IO/MZNUF; https://osf.io/mznuf International Registered Report Identifier (IRRID): PRR1-10.2196/63887 
UR  - https://www.researchprotocols.org/2025/1/e63887
UR  - http://dx.doi.org/10.2196/63887
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/63887
ER  - 

TY  - JOUR
AU  - Gazquez-Garcia, Javier
AU  - Sánchez-Bocanegra, Luis Carlos
AU  - Sevillano, Luis Jose
PY  - 2025/2/5
TI  - AI in the Health Sector: Systematic Review of Key Skills for Future Health Professionals
JO  - JMIR Med Educ
SP  - e58161
VL  - 11
KW  - artificial intelligence
KW  - healthcare competencies
KW  - systematic review
KW  - healthcare education
KW  - AI regulation
N2  - Background: Technological advancements have significantly reshaped health care, introducing digital solutions that enhance diagnostics and patient care. Artificial intelligence (AI) stands out, offering unprecedented capabilities in data analysis, diagnostic support, and personalized medicine. However, effectively integrating AI into health care necessitates specialized competencies among professionals, an area still in its infancy in terms of comprehensive literature and formalized training programs. Objective: This systematic review aims to consolidate the essential skills and knowledge health care professionals need to integrate AI into their clinical practice effectively, according to the published literature. Methods: We conducted a systematic review, across databases PubMed, Scopus, and Web of Science, of peer-reviewed literature that directly explored the required skills for health care professionals to integrate AI into their practice, published in English or Spanish from 2018 onward. Studies that did not refer to specific skills or training in digital health were not included, discarding those that did not directly contribute to understanding the competencies necessary to integrate AI into health care practice. Bias in the examined works was evaluated following Cochrane?s domain-based recommendations. Results: The initial database search yielded a total of 2457 articles. After deleting duplicates and screening titles and abstracts, 37 articles were selected for full-text review. Out of these, only 7 met all the inclusion criteria for this systematic review. The review identified a diverse range of skills and competencies, that we categorized into 14 key areas classified based on their frequency of appearance in the selected studies, including AI fundamentals, data analytics and management, and ethical considerations. Conclusions: Despite the broadening of search criteria to capture the evolving nature of AI in health care, the review underscores a significant gap in focused studies on the required competencies. Moreover, the review highlights the critical role of regulatory bodies such as the US Food and Drug Administration in facilitating the adoption of AI technologies by establishing trust and standardizing algorithms. Key areas were identified for developing competencies among health care professionals for the implementation of AI, including: AI fundamentals knowledge (more focused on assessing the accuracy, reliability, and validity of AI algorithms than on more technical abilities such as programming or mathematics), data analysis skills (including data acquisition, cleaning, visualization, management, and governance), and ethical and legal considerations. In an AI-enhanced health care landscape, the ability to humanize patient care through effective communication is paramount. This balance ensures that while AI streamlines tasks and potentially increases patient interaction time, health care professionals maintain a focus on compassionate care, thereby leveraging AI to enhance, rather than detract from, the patient experience.? 
UR  - https://mededu.jmir.org/2025/1/e58161
UR  - http://dx.doi.org/10.2196/58161
ID  - info:doi/10.2196/58161
ER  - 

TY  - JOUR
AU  - Elhassan, Elwaleed Safia
AU  - Sajid, Raihan Muhammad
AU  - Syed, Mariam Amina
AU  - Fathima, Afreen Sidrah
AU  - Khan, Shehroz Bushra
AU  - Tamim, Hala
PY  - 2025/1/30
TI  - Assessing Familiarity, Usage Patterns, and Attitudes of Medical Students Toward ChatGPT and Other Chat-Based AI Apps in Medical Education: Cross-Sectional Questionnaire Study
JO  - JMIR Med Educ
SP  - e63065
VL  - 11
KW  - ChatGPT
KW  - artificial intelligence
KW  - large language model
KW  - medical students
KW  - ethics
KW  - chat-based
KW  - AI apps
KW  - medical education
KW  - social media
KW  - attitude
KW  - AI
N2  - Background: There has been a rise in the popularity of ChatGPT and other chat-based artificial intelligence (AI) apps in medical education. Despite data being available from other parts of the world, there is a significant lack of information on this topic in medical education and research, particularly in Saudi Arabia. Objective: The primary objective of the study was to examine the familiarity, usage patterns, and attitudes of Alfaisal University medical students toward ChatGPT and other chat-based AI apps in medical education. Methods: This was a cross-sectional study conducted from October 8, 2023, through November 22, 2023. A questionnaire was distributed through social media channels to medical students at Alfaisal University who were 18 years or older. Current Alfaisal University medical students in years 1 through 6, of both genders, were exclusively targeted by the questionnaire. The study was approved by Alfaisal University Institutional Review Board. A ?2 test was conducted to assess the relationships between gender, year of study, familiarity, and reasons for usage. Results: A total of 293 responses were received, of which 95 (32.4%) were from men and 198 (67.6%) were from women. There were 236 (80.5%) responses from preclinical students and 57 (19.5%) from clinical students, respectively. Overall, males (n=93, 97.9%) showed more familiarity with ChatGPT compared to females (n=180, 90.09%; P=.03). Additionally, males also used Google Bard and Microsoft Bing ChatGPT more than females (P<.001). Clinical-year students used ChatGPT significantly more for general writing purposes compared to preclinical students (P=.005). Additionally, 136 (46.4%) students believed that using ChatGPT and other chat-based AI apps for coursework was ethical, 86 (29.4%) were neutral, and 71 (24.2%) considered it unethical (all Ps>.05). Conclusions: Familiarity with and usage of ChatGPT and other chat-based AI apps were common among the students of Alfaisal University. The usage patterns of these apps differ between males and females and between preclinical and clinical-year students. 
UR  - https://mededu.jmir.org/2025/1/e63065
UR  - http://dx.doi.org/10.2196/63065
ID  - info:doi/10.2196/63065
ER  - 

TY  - JOUR
AU  - Li, Rui
AU  - Wu, Tong
PY  - 2025/1/30
TI  - Evolution of Artificial Intelligence in Medical Education From 2000 to 2024: Bibliometric Analysis
JO  - Interact J Med Res
SP  - e63775
VL  - 14
KW  - artificial intelligence
KW  - medical education
KW  - bibliometric
KW  - citation trends
KW  - academic pattern
KW  - VOSviewer
KW  - Citespace
KW  - AI
N2  - Background: Incorporating artificial intelligence (AI) into medical education has gained significant attention for its potential to enhance teaching and learning outcomes. However, it lacks a comprehensive study depicting the academic performance and status of AI in the medical education domain. Objective: This study aims to analyze the social patterns, productive contributors, knowledge structure, and clusters since the 21st century. Methods: Documents were retrieved from the Web of Science Core Collection database from 2000 to 2024. VOSviewer, Incites, and Citespace were used to analyze the bibliometric metrics, which were categorized by country, institution, authors, journals, and keywords. The variables analyzed encompassed counts, citations, H-index, impact factor, and collaboration metrics. Results: Altogether, 7534 publications were initially retrieved and 2775 were included for analysis. The annual count and citation of papers exhibited exponential trends since 2018. The United States emerged as the lead contributor due to its high productivity and recognition levels. Stanford University, Johns Hopkins University, National University of Singapore, Mayo Clinic, University of Arizona, and University of Toronto were representative institutions in their respective fields. Cureus, JMIR Medical Education, Medical Teacher, and BMC Medical Education ranked as the top four most productive journals. The resulting heat map highlighted several high-frequency keywords, including performance, education, AI, and model. The citation burst time of terms revealed that AI technologies shifted from imaging processing (2000), augmented reality (2013), and virtual reality (2016) to decision-making (2020) and model (2021). Keywords such as mortality and robotic surgery persisted into 2023, suggesting the ongoing recognition and interest in these areas. Conclusions: This study provides valuable insights and guidance for researchers who are interested in educational technology, as well as recommendations for pioneering institutions and journal submissions. Along with the rapid growth of AI, medical education is expected to gain much more benefits. 
UR  - https://www.i-jmr.org/2025/1/e63775
UR  - http://dx.doi.org/10.2196/63775
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/63775
ER  - 

TY  - JOUR
AU  - Taira, Kazuya
AU  - Itaya, Takahiro
AU  - Yada, Shuntaro
AU  - Hiyama, Kirara
AU  - Hanada, Ayame
PY  - 2025/1/22
TI  - Impact of Attached File Formats on the Performance of ChatGPT-4 on the Japanese National Nursing Examination: Evaluation Study
JO  - JMIR Nursing
SP  - e67197
VL  - 8
KW  - nursing examination
KW  - machine learning
KW  - ML
KW  - artificial intelligence
KW  - AI
KW  - large language models
KW  - ChatGPT
KW  - generative AI
N2  - Abstract: This research letter discusses the impact of different file formats on ChatGPT-4?s performance on the Japanese National Nursing Examination, highlighting the need for standardized reporting protocols to enhance the integration of artificial intelligence in nursing education and practice. 
UR  - https://nursing.jmir.org/2025/1/e67197
UR  - http://dx.doi.org/10.2196/67197
ID  - info:doi/10.2196/67197
ER  - 

TY  - JOUR
AU  - Wei, Boxiong
PY  - 2025/1/16
TI  - Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
JO  - JMIR Med Educ
SP  - e64284
VL  - 11
KW  - large language models
KW  - LLM
KW  - artificial intelligence
KW  - AI
KW  - GPT-4
KW  - radiology exams
KW  - medical education
KW  - diagnostics
KW  - medical training
KW  - radiology
KW  - ultrasound
N2  - Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using ?2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18?0.60) for Claude, 0.24 (95% CI 0.13?0.44) for Bard, and 0.25 (95% CI 0.14?0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27?0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models? effectiveness in specialized fields like radiology. 
UR  - https://mededu.jmir.org/2025/1/e64284
UR  - http://dx.doi.org/10.2196/64284
ID  - info:doi/10.2196/64284
ER  - 

TY  - JOUR
AU  - Kim, JaeYong
AU  - Vajravelu, Narayan Bathri
PY  - 2025/1/16
TI  - Assessing the Current Limitations of Large Language Models in Advancing Health Care Education
JO  - JMIR Form Res
SP  - e51319
VL  - 9
KW  - large language model
KW  - generative pretrained transformer
KW  - health care education
KW  - health care delivery
KW  - artificial intelligence
KW  - LLM
KW  - ChatGPT
KW  - AI
UR  - https://formative.jmir.org/2025/1/e51319
UR  - http://dx.doi.org/10.2196/51319
ID  - info:doi/10.2196/51319
ER  - 

TY  - JOUR
AU  - Kaewboonlert, Naritsaret
AU  - Poontananggul, Jiraphon
AU  - Pongsuwan, Natthipong
AU  - Bhakdisongkhram, Gun
PY  - 2025/1/13
TI  - Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e58898
VL  - 11
KW  - accuracy
KW  - performance
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - large language model
KW  - LLM
KW  - difficulty index
KW  - basic medical science examination
KW  - cross-sectional study
KW  - medical education
KW  - datasets
KW  - assessment
KW  - medical science
KW  - tool
KW  - Google
N2  - Background: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. Objective: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. Methods: We used questions that were closely aligned with the content and topic distribution of Thailand?s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). Results: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%?92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%?87.80%), GPT-3.5 at 67.02% (95% CI 61.20%?72.48%), and Google Bard at 63.83% (95% CI 57.92%?69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. Conclusions: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item?s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts. 
UR  - https://mededu.jmir.org/2025/1/e58898
UR  - http://dx.doi.org/10.2196/58898
ID  - info:doi/10.2196/58898
ER  - 

TY  - JOUR
AU  - Rjoop, Anwar
AU  - Al-Qudah, Mohammad
AU  - Alkhasawneh, Raja
AU  - Bataineh, Nesreen
AU  - Abdaljaleel, Maram
AU  - Rjoub, A. Moayad
AU  - Alkhateeb, Mustafa
AU  - Abdelraheem, Mohammad
AU  - Al-Omari, Salem
AU  - Bani-Mari, Omar
AU  - Alkabalan, Anas
AU  - Altulaih, Saoud
AU  - Rjoub, Iyad
AU  - Alshimi, Rula
PY  - 2025/1/10
TI  - Awareness and Attitude Toward Artificial Intelligence Among Medical Students and Pathology Trainees: Survey Study
JO  - JMIR Med Educ
SP  - e62669
VL  - 11
KW  - artificial intelligence
KW  - AI
KW  - deep learning
KW  - medical schools
KW  - pathology
KW  - Jordan
KW  - medical education
KW  - awareness
KW  - attitude
KW  - medical students
KW  - pathology trainees
KW  - national survey study
KW  - medical practice
KW  - training
KW  - web-based survey
KW  - survey
KW  - questionnaire
N2  - Background: Artificial intelligence (AI) is set to shape the future of medical practice. The perspective and understanding of medical students are critical for guiding the development of educational curricula and training. Objective: This study aims to assess and compare medical AI-related attitudes among medical students in general medicine and in one of the visually oriented fields (pathology), along with illuminating their anticipated role of AI in the rapidly evolving landscape of AI-enhanced health care. Methods: This was a cross-sectional study that used a web-based survey composed of a closed-ended questionnaire. The survey addressed medical students at all educational levels across the 5 public medical schools, along with pathology residents in 4 residency programs in Jordan. Results: A total of 394 respondents participated (328 medical students and 66 pathology residents). The majority of respondents (272/394, 69%) were already aware of AI and deep learning in medicine, mainly relying on websites for information on AI, while only 14% (56/394) were aware of AI through medical schools. There was a statistically significant difference in awareness among respondents who consider themselves tech experts compared with those who do not (P=.03). More than half of the respondents believed that AI could be used to diagnose diseases automatically (213/394, 54.1% agreement), with medical students agreeing more than pathology residents (P=.04). However, more than one-third expressed fear about recent AI developments (167/394, 42.4% agreed). Two-thirds of respondents disagreed that their medical schools had educated them about AI and its potential use (261/394, 66.2% disagreed), while 46.2% (182/394) expressed interest in learning about AI in medicine. In terms of pathology-specific questions, 75.4% (297/394) agreed that AI could be used to identify pathologies in slide examinations automatically. There was a significant difference between medical students and pathology residents in their agreement (P=.001). Overall, medical students and pathology trainees had similar responses. Conclusions: AI education should be introduced into medical school curricula to improve medical students? understanding and attitudes. Students agreed that they need to learn about AI?s applications, potential hazards, and legal and ethical implications. This is the first study to analyze medical students? views and awareness of AI in Jordan, as well as the first to include pathology residents? perspectives. The findings are consistent with earlier research internationally. In comparison with prior research, these attitudes are similar in low-income and industrialized countries, highlighting the need for a global strategy to introduce AI instruction to medical students everywhere in this era of rapidly expanding technology. 
UR  - https://mededu.jmir.org/2025/1/e62669
UR  - http://dx.doi.org/10.2196/62669
ID  - info:doi/10.2196/62669
ER  - 

TY  - JOUR
AU  - Zhu, Shiben
AU  - Hu, Wanqin
AU  - Yang, Zhi
AU  - Yan, Jiani
AU  - Zhang, Fang
PY  - 2025/1/10
TI  - Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
JO  - JMIR Med Inform
SP  - e63731
VL  - 13
KW  - large language models
KW  - LLMs
KW  - Chinese National Nursing Licensing Examination
KW  - ChatGPT
KW  - Qwen-2.5
KW  - multiple-choice questions
KW  - 
N2  - Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain?specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training. 
UR  - https://medinform.jmir.org/2025/1/e63731
UR  - http://dx.doi.org/10.2196/63731
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/63731
ER  - 

TY  - JOUR
AU  - Zhang, Yong
AU  - Lu, Xiao
AU  - Luo, Yan
AU  - Zhu, Ying
AU  - Ling, Wenwu
PY  - 2025/1/9
TI  - Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis
JO  - JMIR Med Inform
SP  - e63924
VL  - 13
KW  - chatbots
KW  - ChatGPT
KW  - ERNIE Bot
KW  - performance
KW  - accuracy rates
KW  - ultrasound
KW  - language
KW  - examination
N2  - Background: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. Objective: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. Methods: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. Results: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot?s decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. Conclusions: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use. 
UR  - https://medinform.jmir.org/2025/1/e63924
UR  - http://dx.doi.org/10.2196/63924
ID  - info:doi/10.2196/63924
ER  - 

TY  - JOUR
AU  - Bland, Tyler
PY  - 2025/1/6
TI  - Enhancing Medical Student Engagement Through Cinematic Clinical Narratives: Multimodal Generative AI?Based Mixed Methods Study
JO  - JMIR Med Educ
SP  - e63865
VL  - 11
KW  - artificial intelligence
KW  - cinematic clinical narratives
KW  - cinemeducation
KW  - medical education
KW  - narrative learning
KW  - AI
KW  - medical student
KW  - pharmacology
KW  - preclinical education
KW  - long-term retention
KW  - AI tools
KW  - GPT-4
KW  - image
KW  - applicability
N2  - Background: Medical students often struggle to engage with and retain complex pharmacology topics during their preclinical education. Traditional teaching methods can lead to passive learning and poor long-term retention of critical concepts. Objective: This study aims to enhance the teaching of clinical pharmacology in medical school by using a multimodal generative artificial intelligence (genAI) approach to create compelling, cinematic clinical narratives (CCNs). Methods: We transformed a standard clinical case into an engaging, interactive multimedia experience called ?Shattered Slippers.? This CCN used various genAI tools for content creation: GPT-4 for developing the storyline, Leonardo.ai and Stable Diffusion for generating images, Eleven Labs for creating audio narrations, and Suno for composing a theme song. The CCN integrated narrative styles and pop culture references to enhance student engagement. It was applied in teaching first-year medical students about immune system pharmacology. Student responses were assessed through the Situational Interest Survey for Multimedia and examination performance. The target audience comprised first-year medical students (n=40), with 18 responding to the Situational Interest Survey for Multimedia survey (n=18). Results: The study revealed a marked preference for the genAI-enhanced CCNs over traditional teaching methods. Key findings include the majority of surveyed students preferring the CCN over traditional clinical cases (14/18), as well as high average scores for triggered situational interest (mean 4.58, SD 0.53), maintained interest (mean 4.40, SD 0.53), maintained-feeling interest (mean 4.38, SD 0.51), and maintained-value interest (mean 4.42, SD 0.54). Students achieved an average score of 88% on examination questions related to the CCN material, indicating successful learning and retention. Qualitative feedback highlighted increased engagement, improved recall, and appreciation for the narrative style and pop culture references. Conclusions: This study demonstrates the potential of using a multimodal genAI-driven approach to create CCNs in medical education. The ?Shattered Slippers? case effectively enhanced student engagement and promoted knowledge retention in complex pharmacological topics. This innovative method suggests a novel direction for curriculum development that could improve learning outcomes and student satisfaction in medical education. Future research should explore the long-term retention of knowledge and the applicability of learned material in clinical settings, as well as the potential for broader implementation of this approach across various medical education contexts. 
UR  - https://mededu.jmir.org/2025/1/e63865
UR  - http://dx.doi.org/10.2196/63865
ID  - info:doi/10.2196/63865
ER  - 

TY  - JOUR
AU  - Wang, Heng
AU  - Zheng, Danni
AU  - Wang, Mengying
AU  - Ji, Hong
AU  - Han, Jiangli
AU  - Wang, Yan
AU  - Shen, Ning
AU  - Qiao, Jie
PY  - 2025/1/3
TI  - Artificial Intelligence?Powered Training Database for Clinical Thinking: App Development Study
JO  - JMIR Form Res
SP  - e58426
VL  - 9
KW  - artificial intelligence
KW  - clinical thinking ability
KW  - virtual medical records
KW  - distance education
KW  - medical education
KW  - online learning
N2  - Background: With the development of artificial intelligence (AI), medicine has entered the era of intelligent medicine, and various aspects, such as medical education and talent cultivation, are also being redefined. The cultivation of clinical thinking abilities poses a formidable challenge even for seasoned clinical educators, as offline training modalities often fall short in bridging the divide between current practice and the desired ideal. Consequently, there arises an imperative need for the expeditious development of a web-based database, tailored to empower physicians in their quest to learn and hone their clinical reasoning skills. Objective: This study aimed to introduce an app named ?XueYiKu,? which includes consultations, physical examinations, auxiliary examinations, and diagnosis, incorporating AI and actual complete hospital medical records to build an online-learning platform using human-computer interaction. Methods: The ?XueYiKu? app was designed as a contactless, self-service, trial-and-error system application based on actual complete hospital medical records and natural language processing technology to comprehensively assess the ?clinical competence? of residents at different stages. Case extraction was performed at a hospital?s case data center, and the best-matching cases were differentiated through natural language processing, word segmentation, synonym conversion, and sorting. More than 400 teaching cases covering 65 kinds of diseases were released for students to learn, and the subjects covered internal medicine, surgery, gynecology and obstetrics, and pediatrics. The difficulty of learning cases was divided into four levels in ascending order. Moreover, the learning and teaching effects were evaluated using 6 dimensions covering systematicness, agility, logic, knowledge expansion, multidimensional evaluation indicators, and preciseness. Results: From the app?s first launch on the Android platform in May 2019 to the last version updated in May 2023, the total number of teacher and student users was 6209 and 1180, respectively. The top 3 subjects most frequently learned were respirology (n=606, 24.1%), general surgery (n=506, 20.1%), and urinary surgery (n=390, 15.5%). For diseases, pneumonia was the most frequently learned, followed by cholecystolithiasis (n=216, 14.1%), benign prostate hyperplasia (n=196, 12.8%), and bladder tumor (n=193, 12.6%). Among 479 students, roughly a third (n=168, 35.1%) scored in the 60 to 80 range, and half of them scored over 80 points (n=238, 49.7%). The app enabled medical students? learning to become more active and self-motivated, with a variety of formats, and provided real-time feedback through assessments on the platform. The learning effect was satisfactory overall and provided important precedence for establishing scientific models and methods for assessing clinical thinking skills in the future. Conclusions: The integration of AI and medical education will undoubtedly assist in the restructuring of education processes; promote the evolution of the education ecosystem; and provide new convenient ways for independent learning, interactive communication, and educational resource sharing. 
UR  - https://formative.jmir.org/2025/1/e58426
UR  - http://dx.doi.org/10.2196/58426
ID  - info:doi/10.2196/58426
ER  - 

TY  - JOUR
AU  - Wang, Chenxu
AU  - Li, Shuhan
AU  - Lin, Nuoxi
AU  - Zhang, Xinyu
AU  - Han, Ying
AU  - Wang, Xiandi
AU  - Liu, Di
AU  - Tan, Xiaomei
AU  - Pu, Dan
AU  - Li, Kang
AU  - Qian, Guangwu
AU  - Yin, Rong
PY  - 2025/1/1
TI  - Application of Large Language Models in Medical Training Evaluation?Using ChatGPT as a Standardized Patient: Multimetric Assessment
JO  - J Med Internet Res
SP  - e59435
VL  - 27
KW  - ChatGPT
KW  - artificial intelligence
KW  - standardized patient
KW  - health care
KW  - prompt engineering
KW  - accuracy
KW  - large language models
KW  - performance evaluation
KW  - medical training
KW  - inflammatory bowel disease
N2  - Background: With the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks. Objective: The study aims to explore ChatGPT?s viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments. Methods: A 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT?s performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT?s response shortcomings, with a comparative analysis of ChatGPT?s performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot. Results: The feasibility test confirmed ChatGPT?s ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P<.001), between the poor and good (85, SD 3.27) inquiry groups (P<.001) were significant at a significance level (?) of .05, while the score differences between the medium and good inquiry groups were not statistically significant (P=.16). The revised prompt significantly improved ChatGPT?s realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompts. The score difference percentage drops from 29.83% to 6.06%, with a drop in SD from 0.55 to 0.068. The performance of the chatbot on a separate script is acceptable with an average score difference percentage of 3.21%. Moreover, the performance differences between test groups using various language combinations were found to be insignificant. Conclusions: ChatGPT, as a representative LLM, is a viable tool for simulating standardized patients in medical assessments, with the potential to enhance medical training. By incorporating proper prompts, ChatGPT?s scoring accuracy and response realism significantly improved, approaching the feasibility of actual clinical use. Also, the influence of the adopted language is nonsignificant on the outcome of the chatbot. 
UR  - https://www.jmir.org/2025/1/e59435
UR  - http://dx.doi.org/10.2196/59435
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/59435
ER  - 

TY  - JOUR
AU  - Miyazaki, Yuki
AU  - Hata, Masahiro
AU  - Omori, Hisaki
AU  - Hirashima, Atsuya
AU  - Nakagawa, Yuta
AU  - Eto, Mitsuhiro
AU  - Takahashi, Shun
AU  - Ikeda, Manabu
PY  - 2024/12/24
TI  - Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions
JO  - JMIR Med Educ
SP  - e63129
VL  - 10
KW  - medical education
KW  - artificial intelligence
KW  - clinical decision-making
KW  - GPT-4o
KW  - medical licensing examination
KW  - Japan
KW  - images
KW  - accuracy
KW  - AI technology
KW  - application
KW  - decision-making
KW  - image-based
KW  - reliability
KW  - ChatGPT
UR  - https://mededu.jmir.org/2024/1/e63129
UR  - http://dx.doi.org/10.2196/63129
ID  - info:doi/10.2196/63129
ER  - 

TY  - JOUR
AU  - Ogundiya, Oluwadamilola
AU  - Rahman, Jasmine Thahmina
AU  - Valnarov-Boulter, Ioan
AU  - Young, Michael Tim
PY  - 2024/12/19
TI  - Looking Back on Digital Medical Education Over the Last 25 Years and Looking to the Future: Narrative Review
JO  - J Med Internet Res
SP  - e60312
VL  - 26
KW  - digital health
KW  - digital medical education
KW  - health education
KW  - medical education
KW  - mobile phone
KW  - artificial intelligence
KW  - AI
N2  - Background: The last 25 years have seen enormous progression in digital technologies across the whole of the health service, including health education. The rapid evolution and use of web-based and digital techniques have been significantly transforming this field since the beginning of the new millennium. These advancements continue to progress swiftly, even more so after the COVID-19 pandemic. Objective: This narrative review aims to outline and discuss the developments that have taken place in digital medical education across the defined time frame. In addition, evidence for potential opportunities and challenges facing digital medical education in the near future was collated for analysis. Methods: Literature reviews were conducted using PubMed, Web of Science Core Collection, Scopus, Google Scholar, and Embase. The participants and learners in this study included medical students, physicians in training or continuing professional development, nurses, paramedics, and patients. Results: Evidence of the significant steps in the development of digital medical education in the past 25 years was presented and analyzed in terms of application, impact, and implications for the future. The results were grouped into the following themes for discussion: learning management systems; telemedicine (in digital medical education); mobile health; big data analytics; the metaverse, augmented reality, and virtual reality; the COVID-19 pandemic; artificial intelligence; and ethics and cybersecurity. Conclusions: Major changes and developments in digital medical education have occurred from around the start of the new millennium. Key steps in this journey include technical developments in teleconferencing and learning management systems, along with a marked increase in mobile device use for accessing learning over this time. While the pace of evolution in digital medical education accelerated during the COVID-19 pandemic, further rapid progress has continued since the resolution of the pandemic. Many of these changes are currently being widely used in health education and other fields, such as augmented reality, virtual reality, and artificial intelligence, providing significant future potential. The opportunities these technologies offer must be balanced against the associated challenges in areas such as cybersecurity, the integrity of web-based assessments, ethics, and issues of digital privacy to ensure that digital medical education continues to thrive in the future. 
UR  - https://www.jmir.org/2024/1/e60312
UR  - http://dx.doi.org/10.2196/60312
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/60312
ER  - 

TY  - JOUR
AU  - Roos, Jonas
AU  - Martin, Ron
AU  - Kaczmarczyk, Robert
PY  - 2024/12/17
TI  - Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study
JO  - JMIR Form Res
SP  - e57592
VL  - 8
KW  - medical education
KW  - visual question answering
KW  - image analysis
KW  - large language model
KW  - LLM
KW  - student
KW  - performance
KW  - comparative
KW  - case study
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - effectiveness
KW  - diagnostic
KW  - training
KW  - accuracy
KW  - utility
KW  - image-based
KW  - question
KW  - image
KW  - AMBOSS
KW  - English
KW  - German
KW  - question and answer
KW  - Python
KW  - AI in health care
KW  - health care
N2  - Background: The rapid development of large language models (LLMs) such as OpenAI?s ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities. Objective: This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations. Methods: This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the ?student passed mean? and ?majority vote.? Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization. Results: GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard?s 44.6% (477/1070), a statistically significant difference (?2?=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard?s 4.1% (44/1070; ?2?=83.1, P<.001). When considering only answered questions, GPT-4 1106?s accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; ?2?=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; ?2?=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; ?2?=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; ?2?=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: ?2?=408.5, P<.001; Bard Gemini Pro: ?2?=626.6, P<.001). Conclusions: Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical. 
UR  - https://formative.jmir.org/2024/1/e57592
UR  - http://dx.doi.org/10.2196/57592
ID  - info:doi/10.2196/57592
ER  - 

TY  - JOUR
AU  - Dzuali, Fiatsogbe
AU  - Seiger, Kira
AU  - Novoa, Roberto
AU  - Aleshin, Maria
AU  - Teng, Joyce
AU  - Lester, Jenna
AU  - Daneshjou, Roxana
PY  - 2024/12/10
TI  - ChatGPT May Improve Access to Language-Concordant Care for Patients With Non?English Language Preferences
JO  - JMIR Med Educ
SP  - e51435
VL  - 10
KW  - ChatGPT
KW  - artificial intelligence
KW  - language
KW  - translation
KW  - health care disparity
KW  - natural language model
KW  - survey
KW  - patient education
KW  - preference
KW  - human language
KW  - language-concordant care
UR  - https://mededu.jmir.org/2024/1/e51435
UR  - http://dx.doi.org/10.2196/51435
ID  - info:doi/10.2196/51435
ER  - 

TY  - JOUR
AU  - Jin, Kyung Hye
AU  - Kim, EunYoung
PY  - 2024/12/4
TI  - Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
JO  - JMIR Med Educ
SP  - e57451
VL  - 10
KW  - GPT-3.5
KW  - GPT-4
KW  - Korean
KW  - Korean Pharmacist Licensing Examination
KW  - KPLE
N2  - Background: ChatGPT, a recently developed artificial intelligence chatbot and a notable large language model, has demonstrated improved performance on medical field examinations. However, there is currently little research on its efficacy in languages other than English or in pharmacy-related examinations. Objective: This study aimed to evaluate the performance of GPT models on the Korean Pharmacist Licensing Examination (KPLE). Methods: We evaluated the percentage of correct answers provided by 2 different versions of ChatGPT (GPT-3.5 and GPT-4) for all multiple-choice single-answer KPLE questions, excluding image-based questions. In total, 320, 317, and 323 questions from the 2021, 2022, and 2023 KPLEs, respectively, were included in the final analysis, which consisted of 4 units: Biopharmacy, Industrial Pharmacy, Clinical and Practical Pharmacy, and Medical Health Legislation. Results: The 3-year average percentage of correct answers was 86.5% (830/960) for GPT-4 and 60.7% (583/960) for GPT-3.5. GPT model accuracy was highest in Biopharmacy (GPT-3.5 77/96, 80.2% in 2022; GPT-4 87/90, 96.7% in 2021) and lowest in Medical Health Legislation (GPT-3.5 8/20, 40% in 2022; GPT-4 12/20, 60% in 2022). Additionally, when comparing the performance of artificial intelligence with that of human participants, pharmacy students outperformed GPT-3.5 but not GPT-4. Conclusions: In the last 3 years, GPT models have performed very close to or exceeded the passing threshold for the KPLE. This study demonstrates the potential of large language models in the pharmacy domain; however, extensive research is needed to evaluate their reliability and ensure their secure application in pharmacy contexts due to several inherent challenges. Addressing these limitations could make GPT models more effective auxiliary tools for pharmacy education. 
UR  - https://mededu.jmir.org/2024/1/e57451
UR  - http://dx.doi.org/10.2196/57451
ID  - info:doi/10.2196/57451
ER  - 

TY  - JOUR
AU  - Luo, Yuan
AU  - Miao, Yiqun
AU  - Zhao, Yuhan
AU  - Li, Jiawei
AU  - Chen, Yuling
AU  - Yue, Yuexue
AU  - Wu, Ying
PY  - 2024/12/2
TI  - Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study
JO  - JMIR Form Res
SP  - e63188
VL  - 8
KW  - rumor
KW  - misconception
KW  - health science popularization
KW  - health education
KW  - large language model
KW  - LLM
KW  - applicability
KW  - accuracy
KW  - effectiveness
KW  - health related
KW  - education
KW  - health science
KW  - proof of concept
N2  - Background: Health-related rumors and misconceptions are spreading at an alarming rate, fueled by the rapid development of the internet and the exponential growth of social media platforms. This phenomenon has become a pressing global concern, as the dissemination of false information can have severe consequences, including widespread panic, social instability, and even public health crises. Objective: The aim of the study is to compare the accuracy of rumor identification and the effectiveness of health science popularization between 2 generated large language models in Chinese (GPT-4 by OpenAI and Enhanced Representation through Knowledge Integration Bot [ERNIE Bot] 4.0 by Baidu). Methods: In total, 20 health rumors and misconceptions, along with 10 health truths, were randomly inputted into GPT-4 and ERNIE Bot 4.0. We prompted them to determine whether the statements were rumors or misconceptions and provide explanations for their judgment. Further, we asked them to generate a health science popularization essay. We evaluated the outcomes in terms of accuracy, effectiveness, readability, and applicability. Accuracy was assessed by the rate of correctly identifying health-related rumors, misconceptions, and truths. Effectiveness was determined by the accuracy of the generated explanation, which was assessed collaboratively by 2 research team members with a PhD in nursing. Readability was calculated by the readability formula of Chinese health education materials. Applicability was evaluated by the Chinese Suitability Assessment of Materials. Results: GPT-4 and ERNIE Bot 4.0 correctly identified all health rumors and misconceptions (100% accuracy rate). For truths, the accuracy rate was 70% (7/10) and 100% (10/10), respectively. Both mostly provided widely recognized viewpoints without obvious errors. The average readability score for the health essays was 2.92 (SD 0.85) for GPT-4 and 3.02 (SD 0.84) for ERNIE Bot 4.0 (P=.65). For applicability, except for the content and cultural appropriateness category, significant differences were observed in the total score and scores in other dimensions between them (P<.05). Conclusions: ERNIE Bot 4.0 demonstrated similar accuracy to GPT-4 in identifying Chinese rumors. Both provided widely accepted views, despite some inaccuracies. These insights enhance understanding and correct misunderstandings. For health essays, educators can learn from readable language styles of GLLMs. Finally, ERNIE Bot 4.0 aligns with Chinese expression habits, making it a good choice for a better Chinese reading experience. 
UR  - https://formative.jmir.org/2024/1/e63188
UR  - http://dx.doi.org/10.2196/63188
ID  - info:doi/10.2196/63188
ER  - 

TY  - JOUR
AU  - Ehrett, Carl
AU  - Hegde, Sudeep
AU  - Andre, Kwame
AU  - Liu, Dixizi
AU  - Wilson, Timothy
PY  - 2024/11/19
TI  - Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e51433
VL  - 10
KW  - data augmentation
KW  - large language models
KW  - medical education
KW  - natural language processing
KW  - data security
KW  - ethics
KW  - AI
KW  - artificial intelligence
KW  - data privacy
KW  - medical staff
N2  - Background: Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI?s ChatGPT. Objective: This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods: The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results: The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers? performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions: The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field. 
UR  - https://mededu.jmir.org/2024/1/e51433
UR  - http://dx.doi.org/10.2196/51433
ID  - info:doi/10.2196/51433
ER  - 

TY  - JOUR
AU  - Zhou, You
AU  - Li, Si-Jia
AU  - Tang, Xing-Yi
AU  - He, Yi-Chen
AU  - Ma, Hao-Ming
AU  - Wang, Ao-Qi
AU  - Pei, Run-Yuan
AU  - Piao, Mei-Hua
PY  - 2024/11/19
TI  - Using ChatGPT in Nursing: Scoping Review of Current Opinions
JO  - JMIR Med Educ
SP  - e54297
VL  - 10
KW  - ChatGPT
KW  - large language model
KW  - nursing
KW  - artificial intelligence
KW  - scoping review
KW  - generative AI
KW  - nursing education
N2  - Background: Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields. Objective: We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT?s strengths, weaknesses, and the potential impacts it may cause. Methods: This scoping review was conducted following the framework of Arksey and O?Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently. Results: A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on ?ChatGPT and nursing education? (20 studies), ?ChatGPT and nursing practice? (10 studies), and ?ChatGPT and nursing research, writing, and examination? (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields. Conclusions: As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice. 
UR  - https://mededu.jmir.org/2024/1/e54297
UR  - http://dx.doi.org/10.2196/54297
ID  - info:doi/10.2196/54297
ER  - 

TY  - JOUR
AU  - Ros-Arlanzón, Pablo
AU  - Perez-Sempere, Angel
PY  - 2024/11/14
TI  - Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain
JO  - JMIR Med Educ
SP  - e56762
VL  - 10
KW  - artificial intelligence
KW  - ChatGPT
KW  - clinical decision-making
KW  - medical education
KW  - medical knowledge assessment
KW  - OpenAI
N2  - Background: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Objective: This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI?s capabilities and limitations in medical knowledge. Methods: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom?s Taxonomy. Statistical analysis of performance, including the ? coefficient for response consistency, was performed. Results: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher ? coefficient of 0.73, compared to ChatGPT-3.5?s coefficient of 0.69. Conclusions: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4?s performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment. 
UR  - https://mededu.jmir.org/2024/1/e56762
UR  - http://dx.doi.org/10.2196/56762
ID  - info:doi/10.2196/56762
ER  - 

TY  - JOUR
AU  - Ming, Shuai
AU  - Yao, Xi
AU  - Guo, Xiaohong
AU  - Guo, Qingge
AU  - Xie, Kunpeng
AU  - Chen, Dandan
AU  - Lei, Bo
PY  - 2024/11/14
TI  - Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study
JO  - J Med Internet Res
SP  - e60226
VL  - 26
KW  - artificial intelligence
KW  - chatbot
KW  - ChatGPT
KW  - ophthalmic registration
KW  - clinical diagnosis
KW  - AI
KW  - cross-sectional study
KW  - eye disease
KW  - eye disorder
KW  - ophthalmology
KW  - health care
KW  - outpatient registration
KW  - clinical
KW  - decision-making
KW  - generative AI
KW  - vision impairment
N2  - Background: Artificial intelligence (AI) chatbots such as ChatGPT are expected to impact vision health care significantly. Their potential to optimize the consultation process and diagnostic capabilities across range of ophthalmic subspecialties have yet to be fully explored. Objective: This study aims to investigate the performance of AI chatbots in recommending ophthalmic outpatient registration and diagnosing eye diseases within clinical case profiles. Methods: This cross-sectional study used clinical cases from Chinese Standardized Resident Training?Ophthalmology (2nd Edition). For each case, 2 profiles were created: patient with history (Hx) and patient with history and examination (Hx+Ex). These profiles served as independent queries for GPT-3.5 and GPT-4.0 (accessed from March 5 to 18, 2024). Similarly, 3 ophthalmic residents were posed the same profiles in a questionnaire format. The accuracy of recommending ophthalmic subspecialty registration was primarily evaluated using Hx profiles. The accuracy of the top-ranked diagnosis and the accuracy of the diagnosis within the top 3 suggestions (do-not-miss diagnosis) were assessed using Hx+Ex profiles. The gold standard for judgment was the published, official diagnosis. Characteristics of incorrect diagnoses by ChatGPT were also analyzed. Results: A total of 208 clinical profiles from 12 ophthalmic subspecialties were analyzed (104 Hx and 104 Hx+Ex profiles). For Hx profiles, GPT-3.5, GPT-4.0, and residents showed comparable accuracy in registration suggestions (66/104, 63.5%; 81/104, 77.9%; and 72/104, 69.2%, respectively; P=.07), with ocular trauma, retinal diseases, and strabismus and amblyopia achieving the top 3 accuracies. For Hx+Ex profiles, both GPT-4.0 and residents demonstrated higher diagnostic accuracy than GPT-3.5 (62/104, 59.6% and 63/104, 60.6% vs 41/104, 39.4%; P=.003 and P=.001, respectively). Accuracy for do-not-miss diagnoses also improved (79/104, 76% and 68/104, 65.4% vs 51/104, 49%; P<.001 and P=.02, respectively). The highest diagnostic accuracies were observed in glaucoma; lens diseases; and eyelid, lacrimal, and orbital diseases. GPT-4.0 recorded fewer incorrect top-3 diagnoses (25/42, 60% vs 53/63, 84%; P=.005) and more partially correct diagnoses (21/42, 50% vs 7/63 11%; P<.001) than GPT-3.5, while GPT-3.5 had more completely incorrect (27/63, 43% vs 7/42, 17%; P=.005) and less precise diagnoses (22/63, 35% vs 5/42, 12%; P=.009). Conclusions: GPT-3.5 and GPT-4.0 showed intermediate performance in recommending ophthalmic subspecialties for registration. While GPT-3.5 underperformed, GPT-4.0 approached and numerically surpassed residents in differential diagnosis. AI chatbots show promise in facilitating ophthalmic patient registration. However, their integration into diagnostic decision-making requires more validation. 
UR  - https://www.jmir.org/2024/1/e60226
UR  - http://dx.doi.org/10.2196/60226
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/60226
ER  - 

TY  - JOUR
AU  - Bicknell, T. Brenton
AU  - Butler, Danner
AU  - Whalen, Sydney
AU  - Ricks, James
AU  - Dixon, J. Cory
AU  - Clark, B. Abigail
AU  - Spaedy, Olivia
AU  - Skelton, Adam
AU  - Edupuganti, Neel
AU  - Dzubinski, Lance
AU  - Tate, Hudson
AU  - Dyess, Garrett
AU  - Lindeman, Brenessa
AU  - Lehmann, Soleymani Lisa
PY  - 2024/11/6
TI  - ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis
JO  - JMIR Med Educ
SP  - e63430
VL  - 10
KW  - large language model
KW  - ChatGPT
KW  - medical education
KW  - USMLE
KW  - AI in medical education
KW  - medical student resources
KW  - educational technology
KW  - artificial intelligence in medicine
KW  - clinical skills
KW  - LLM
KW  - medical licensing examination
KW  - medical students
KW  - United States Medical Licensing Examination
KW  - ChatGPT 4 Omni
KW  - ChatGPT 4
KW  - ChatGPT 3.5
N2  - Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models? performances. Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o?s highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o?s diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3?60.3). Conclusions: GPT-4o?s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness. 
UR  - https://mededu.jmir.org/2024/1/e63430
UR  - http://dx.doi.org/10.2196/63430
ID  - info:doi/10.2196/63430
ER  - 

TY  - JOUR
AU  - Alli, Rabia Sauliha
AU  - Hossain, Qahh?r Soaad
AU  - Das, Sunit
AU  - Upshur, Ross
PY  - 2024/11/4
TI  - The Potential of Artificial Intelligence Tools for Reducing Uncertainty in Medicine and Directions for Medical Education
JO  - JMIR Med Educ
SP  - e51446
VL  - 10
KW  - artificial intelligence
KW  - machine learning
KW  - uncertainty
KW  - clinical decision-making
KW  - medical education
KW  - generative AI
KW  - generative artificial intelligence
UR  - https://mededu.jmir.org/2024/1/e51446
UR  - http://dx.doi.org/10.2196/51446
ID  - info:doi/10.2196/51446
ER  - 

TY  - JOUR
AU  - Tao, Wenjuan
AU  - Yang, Jinming
AU  - Qu, Xing
PY  - 2024/10/28
TI  - Utilization of, Perceptions on, and Intention to Use AI Chatbots Among Medical Students in China: National Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e57132
VL  - 10
KW  - medical education
KW  - artificial intelligence
KW  - UTAUT model
KW  - utilization
KW  - medical students
KW  - cross-sectional study
KW  - AI chatbots
KW  - China
KW  - acceptance
KW  - electronic survey
KW  - social media
KW  - medical information
KW  - risk
KW  - training
KW  - support
N2  - Background: Artificial intelligence (AI) chatbots are poised to have a profound impact on medical education. Medical students, as early adopters of technology and future health care providers, play a crucial role in shaping the future of health care. However, little is known about the utilization of, perceptions on, and intention to use AI chatbots among medical students in China. Objective: This study aims to explore the utilization of, perceptions on, and intention to use generative AI chatbots among medical students in China, using the Unified Theory of Acceptance and Use of Technology (UTAUT) framework. By conducting a national cross-sectional survey, we sought to identify the key determinants that influence medical students? acceptance of AI chatbots, thereby providing a basis for enhancing their integration into medical education. Understanding these factors is crucial for educators, policy makers, and technology developers to design and implement effective AI-driven educational tools that align with the needs and expectations of future health care professionals. Methods: A web-based electronic survey questionnaire was developed and distributed via social media to medical students across the country. The UTAUT was used as a theoretical framework to design the questionnaire and analyze the data. The relationship between behavioral intention to use AI chatbots and UTAUT predictors was examined using multivariable regression. Results: A total of 693 participants were from 57 universities covering 21 provinces or municipalities in China. Only a minority (199/693, 28.72%) reported using AI chatbots for studying, with ChatGPT (129/693, 18.61%) being the most commonly used. Most of the participants used AI chatbots for quickly obtaining medical information and knowledge (631/693, 91.05%) and increasing learning efficiency (594/693, 85.71%). Utilization behavior, social influence, facilitating conditions, perceived risk, and personal innovativeness showed significant positive associations with the behavioral intention to use AI chatbots (all P values were <.05). Conclusions: Chinese medical students hold positive perceptions toward and high intentions to use AI chatbots, but there are gaps between intention and actual adoption. This highlights the need for strategies to improve access, training, and support and provide peer usage examples to fully harness the potential benefits of chatbot technology. 
UR  - https://mededu.jmir.org/2024/1/e57132
UR  - http://dx.doi.org/10.2196/57132
ID  - info:doi/10.2196/57132
ER  - 

TY  - JOUR
AU  - Wang, Shuang
AU  - Yang, Liuying
AU  - Li, Min
AU  - Zhang, Xinghe
AU  - Tai, Xiantao
PY  - 2024/10/10
TI  - Medical Education and Artificial Intelligence: Web of Science?Based Bibliometric Analysis (2013-2022)
JO  - JMIR Med Educ
SP  - e51411
VL  - 10
KW  - artificial intelligence
KW  - medical education
KW  - bibliometric analysis
KW  - CiteSpace
KW  - VOSviewer
N2  - Background: Incremental advancements in artificial intelligence (AI) technology have facilitated its integration into various disciplines. In particular, the infusion of AI into medical education has emerged as a significant trend, with noteworthy research findings. Consequently, a comprehensive review and analysis of the current research landscape of AI in medical education is warranted. Objective: This study aims to conduct a bibliometric analysis of pertinent papers, spanning the years 2013?2022, using CiteSpace and VOSviewer. The study visually represents the existing research status and trends of AI in medical education. Methods: Articles related to AI and medical education, published between 2013 and 2022, were systematically searched in the Web of Science core database. Two reviewers scrutinized the initially retrieved papers, based on their titles and abstracts, to eliminate papers unrelated to the topic. The selected papers were then analyzed and visualized for country, institution, author, reference, and keywords using CiteSpace and VOSviewer. Results: A total of 195 papers pertaining to AI in medical education were identified from 2013 to 2022. The annual publications demonstrated an increasing trend over time. The United States emerged as the most active country in this research arena, and Harvard Medical School and the University of Toronto were the most active institutions. Prolific authors in this field included Vincent Bissonnette, Charlotte Blacketer, Rolando F Del Maestro, Nicole Ledows, Nykan Mirchi, Alexander Winkler-Schwartz, and Recai Yilamaz. The paper with the highest citation was ?Medical Students? Attitude Towards Artificial Intelligence: A Multicentre Survey.? Keyword analysis revealed that ?radiology,? ?medical physics,? ?ehealth,? ?surgery,? and ?specialty? were the primary focus, whereas ?big data? and ?management? emerged as research frontiers. Conclusions: The study underscores the promising potential of AI in medical education research. Current research directions encompass radiology, medical information management, and other aspects. Technological progress is expected to broaden these directions further. There is an urgent need to bolster interregional collaboration and enhance research quality. These findings offer valuable insights for researchers to identify perspectives and guide future research directions. 
UR  - https://mededu.jmir.org/2024/1/e51411
UR  - http://dx.doi.org/10.2196/51411
ID  - info:doi/10.2196/51411
ER  - 

TY  - JOUR
AU  - Miao, Jing
AU  - Thongprayoon, Charat
AU  - Garcia Valencia, Oscar
AU  - Craici, M. Iasmina
AU  - Cheungpasitporn, Wisit
PY  - 2024/10/10
TI  - Navigating Nephrology's Decline Through a GPT-4 Analysis of Internal Medicine Specialties in the United States: Qualitative Study
JO  - JMIR Med Educ
SP  - e57157
VL  - 10
KW  - artificial intelligence
KW  - ChatGPT
KW  - nephrology fellowship training
KW  - fellowship matching
KW  - medical education
KW  - AI
KW  - nephrology
KW  - fellowship
KW  - United States
KW  - factor
KW  - chatbots
KW  - intellectual
KW  - complexity
KW  - work-life balance
KW  - procedural involvement
KW  - opportunity
KW  - career demand
KW  - financial compensation
N2  - Background: The 2024 Nephrology fellowship match data show the declining interest in nephrology in the United States, with an 11% drop in candidates and a mere 66% (321/488) of positions filled. Objective: The study aims to discern the factors influencing this trend using ChatGPT, a leading chatbot model, for insights into the comparative appeal of nephrology versus other internal medicine specialties. Methods: Using the GPT-4 model, the study compared nephrology with 13 other internal medicine specialties, evaluating each on 7 criteria including intellectual complexity, work-life balance, procedural involvement, research opportunities, patient relationships, career demand, and financial compensation. Each criterion was assigned scores from 1 to 10, with the cumulative score determining the ranking. The approach included counteracting potential bias by instructing GPT-4 to favor other specialties over nephrology in reverse scenarios. Results: GPT-4 ranked nephrology only above sleep medicine. While nephrology scored higher than hospice and palliative medicine, it fell short in key criteria such as work-life balance, patient relationships, and career demand. When examining the percentage of filled positions in the 2024 appointment year match, nephrology?s filled rate was 66%, only higher than the 45% (155/348) filled rate of geriatric medicine. Nephrology?s score decreased by 4%?14% in 5 criteria including intellectual challenge and complexity, procedural involvement, career opportunity and demand, research and academic opportunities, and financial compensation. Conclusions: ChatGPT does not favor nephrology over most internal medicine specialties, highlighting its diminishing appeal as a career choice. This trend raises significant concerns, especially considering the overall physician shortage, and prompts a reevaluation of factors affecting specialty choice among medical residents. 
UR  - https://mededu.jmir.org/2024/1/e57157
UR  - http://dx.doi.org/10.2196/57157
ID  - info:doi/10.2196/57157
ER  - 

TY  - JOUR
AU  - Goodings, James Anthony
AU  - Kajitani, Sten
AU  - Chhor, Allison
AU  - Albakri, Ahmad
AU  - Pastrak, Mila
AU  - Kodancha, Megha
AU  - Ives, Rowan
AU  - Lee, Bin Yoo
AU  - Kajitani, Kari
PY  - 2024/10/8
TI  - Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study
JO  - JMIR Med Educ
SP  - e56128
VL  - 10
KW  - ChatGPT-4
KW  - Family Medicine Board Examination
KW  - artificial intelligence in medical education
KW  - AI performance assessment
KW  - prompt engineering
KW  - ChatGPT
KW  - artificial intelligence
KW  - AI
KW  - medical education
KW  - assessment
KW  - observational
KW  - analytical method
KW  - data analysis
KW  - examination
N2  - Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, ?AI Family Medicine Board Exam Taker,? designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI?s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4?s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4?s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. 
UR  - https://mededu.jmir.org/2024/1/e56128
UR  - http://dx.doi.org/10.2196/56128
ID  - info:doi/10.2196/56128
ER  - 

TY  - JOUR
AU  - Choi, K. Yong
AU  - Lin, Shih-Yin
AU  - Fick, Marie Donna
AU  - Shulman, W. Richard
AU  - Lee, Sangil
AU  - Shrestha, Priyanka
AU  - Santoso, Kate
PY  - 2024/10/1
TI  - Optimizing ChatGPT?s Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study
JO  - JMIR Form Res
SP  - e51383
VL  - 8
KW  - generative artificial intelligence
KW  - generative AI
KW  - large language models
KW  - ChatGPT
KW  - delirium detection
KW  - Sour Seven Questionnaire
KW  - prompt engineering
KW  - clinical vignettes
KW  - medical education
KW  - caregiver education
N2  - Background: Generative artificial intelligence (AI) and large language models, such as OpenAI?s ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. Objective: This exploratory study aims to evaluate and optimize ChatGPT?s capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models? interpretation and reporting accuracy through iterative prompt optimization. Methods: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI?s processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool?s criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. Results: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models? capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive ?Yes? or ?No? responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. Conclusions: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research. 
UR  - https://formative.jmir.org/2024/1/e51383
UR  - http://dx.doi.org/10.2196/51383
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/51383
ER  - 

TY  - JOUR
AU  - Claman, Daniel
AU  - Sezgin, Emre
PY  - 2024/9/27
TI  - Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models
JO  - JMIR Med Educ
SP  - e52346
VL  - 10
KW  - artificial intelligence
KW  - large language models
KW  - dental education
KW  - GPT
KW  - ChatGPT
KW  - periodontal health
KW  - AI
KW  - LLM
KW  - LLMs
KW  - chatbot
KW  - natural language
KW  - generative pretrained transformer
KW  - innovation
KW  - technology
KW  - large language model
UR  - https://mededu.jmir.org/2024/1/e52346
UR  - http://dx.doi.org/10.2196/52346
ID  - info:doi/10.2196/52346
ER  - 

TY  - JOUR
AU  - Yamamoto, Akira
AU  - Koda, Masahide
AU  - Ogawa, Hiroko
AU  - Miyoshi, Tomoko
AU  - Maeda, Yoshinobu
AU  - Otsuka, Fumio
AU  - Ino, Hideo
PY  - 2024/9/23
TI  - Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial
JO  - JMIR Med Educ
SP  - e58753
VL  - 10
KW  - medical interview
KW  - generative pretrained transformer
KW  - large language model
KW  - simulation-based learning
KW  - OSCE
KW  - artificial intelligence
KW  - medical education
KW  - simulated patients
KW  - nonrandomized controlled trial
N2  - Background: Medical interviewing is a critical skill in clinical practice, yet opportunities for practical training are limited in Japanese medical schools, necessitating urgent measures. Given advancements in artificial intelligence (AI) technology, its application in the medical field is expanding. However, reports on its application in medical interviews in medical education are scarce. Objective: This study aimed to investigate whether medical students? interview skills could be improved by engaging with AI-simulated patients using large language models, including the provision of feedback. Methods: This nonrandomized controlled trial was conducted with fourth-year medical students in Japan. A simulation program using large language models was provided to 35 students in the intervention group in 2023, while 110 students from 2022 who did not participate in the intervention were selected as the control group. The primary outcome was the score on the Pre-Clinical Clerkship Objective Structured Clinical Examination (pre-CC OSCE), a national standardized clinical skills examination, in medical interviewing. Secondary outcomes included surveys such as the Simulation-Based Training Quality Assurance Tool (SBT-QA10), administered at the start and end of the study. Results: The AI intervention group showed significantly higher scores on medical interviews than the control group (AI group vs control group: mean 28.1, SD 1.6 vs 27.1, SD 2.2; P=.01). There was a trend of inverse correlation between the SBT-QA10 and pre-CC OSCE scores (regression coefficient ?2.0 to ?2.1). No significant safety concerns were observed. Conclusions: Education through medical interviews using AI-simulated patients has demonstrated safety and a certain level of educational effectiveness. However, at present, the educational effects of this platform on nonverbal communication skills are limited, suggesting that it should be used as a supplementary tool to traditional simulation education. 
UR  - https://mededu.jmir.org/2024/1/e58753
UR  - http://dx.doi.org/10.2196/58753
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39312284
ID  - info:doi/10.2196/58753
ER  - 

TY  - JOUR
AU  - Yoon, Soo-Hyuk
AU  - Oh, Kyeong Seok
AU  - Lim, Gun Byung
AU  - Lee, Ho-Jin
PY  - 2024/9/16
TI  - Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study
JO  - JMIR Med Educ
SP  - e56859
VL  - 10
KW  - AI tools
KW  - problem solving
KW  - anesthesiology
KW  - artificial intelligence
KW  - pain medicine
KW  - ChatGPT
KW  - health care
KW  - medical education
KW  - South Korea
N2  - Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4?s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. 
UR  - https://mededu.jmir.org/2024/1/e56859
UR  - http://dx.doi.org/10.2196/56859
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/56859
ER  - 

TY  - JOUR
AU  - Holderried, Friederike
AU  - Stegemann-Philipps, Christian
AU  - Herrmann-Werner, Anne
AU  - Festl-Wietek, Teresa
AU  - Holderried, Martin
AU  - Eickhoff, Carsten
AU  - Mahling, Moritz
PY  - 2024/8/16
TI  - A Language Model?Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study
JO  - JMIR Med Educ
SP  - e59213
VL  - 10
KW  - virtual patients communication
KW  - communication skills
KW  - technology enhanced education
KW  - TEL
KW  - medical education
KW  - ChatGPT
KW  - GPT: LLM
KW  - LLMs
KW  - NLP
KW  - natural language processing
KW  - machine learning
KW  - artificial intelligence
KW  - language model
KW  - language models
KW  - communication
KW  - relationship
KW  - relationships
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - history
KW  - histories
KW  - simulated
KW  - student
KW  - students
KW  - interaction
KW  - interactions
N2  - Background: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback. Objective: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students? performance in history taking with a simulated patient. Methods: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients? responses and provide immediate feedback on the comprehensiveness of the students? history taking. Students? interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback. Results: Most of the study?s participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4?s role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed ?almost perfect? agreement (Cohen ?=0.832). Less agreement (?<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model?s assessments were overly specific or diverged from human judgement. Conclusions: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context. 
UR  - https://mededu.jmir.org/2024/1/e59213
UR  - http://dx.doi.org/10.2196/59213
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/59213
ER  - 

TY  - JOUR
AU  - Ming, Shuai
AU  - Guo, Qingge
AU  - Cheng, Wenjun
AU  - Lei, Bo
PY  - 2024/8/13
TI  - Influence of Model Evolution and System Roles on ChatGPT?s Performance in Chinese Medical Licensing Exams: Comparative Study
JO  - JMIR Med Educ
SP  - e52784
VL  - 10
KW  - ChatGPT
KW  - Chinese National Medical Licensing Examination
KW  - large language models
KW  - medical education
KW  - system role
KW  - LLM
KW  - LLMs
KW  - language model
KW  - language models
KW  - artificial intelligence
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - exam
KW  - exams
KW  - examination
KW  - examinations
KW  - OpenAI
KW  - answer
KW  - answers
KW  - response
KW  - responses
KW  - accuracy
KW  - performance
KW  - China
KW  - Chinese
N2  - Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt?s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The ?2 tests and ? values were employed to evaluate the model?s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with ? values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%?3.7%) and GPT-3.5 (1.3%?4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model?s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. 
UR  - https://mededu.jmir.org/2024/1/e52784
UR  - http://dx.doi.org/10.2196/52784
ID  - info:doi/10.2196/52784
ER  - 

TY  - JOUR
AU  - Cherrez-Ojeda, Ivan
AU  - Gallardo-Bastidas, C. Juan
AU  - Robles-Velasco, Karla
AU  - Osorio, F. María
AU  - Velez Leon, Maria Eleonor
AU  - Leon Velastegui, Manuel
AU  - Pauletto, Patrícia
AU  - Aguilar-Díaz, C. F.
AU  - Squassi, Aldo
AU  - González Eras, Patricia Susana
AU  - Cordero Carrasco, Erita
AU  - Chavez Gonzalez, Leonor Karol
AU  - Calderon, C. Juan
AU  - Bousquet, Jean
AU  - Bedbrook, Anna
AU  - Faytong-Haro, Marco
PY  - 2024/8/13
TI  - Understanding Health Care Students? Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e51757
VL  - 10
KW  - artificial intelligence
KW  - ChatGPT
KW  - education
KW  - health care
KW  - students
N2  - Background: ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. Objective: The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants? attitudes toward the use of ChatGPT. Methods: A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. Results: Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was ?minimal? (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) ?somewhat agreed? that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). Conclusions: Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs. 
UR  - https://mededu.jmir.org/2024/1/e51757
UR  - http://dx.doi.org/10.2196/51757
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39137029
ID  - info:doi/10.2196/51757
ER  - 

TY  - JOUR
AU  - Takahashi, Hiromizu
AU  - Shikino, Kiyoshi
AU  - Kondo, Takeshi
AU  - Komori, Akira
AU  - Yamada, Yuji
AU  - Saita, Mizue
AU  - Naito, Toshio
PY  - 2024/8/13
TI  - Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e59133
VL  - 10
KW  - generative AI
KW  - ChatGPT-4
KW  - medical case generation
KW  - medical education
KW  - clinical vignettes
KW  - AI
KW  - artificial intelligence
KW  - Japanese
KW  - Japan
N2  - Background: Evaluating the accuracy and educational utility of artificial intelligence?generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. Objective: This study aimed to assess the educational utility of ChatGPT-4?generated clinical vignettes and their applicability in educational settings. Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians? experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. Conclusions: ChatGPT-4?generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4?s value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application. 
UR  - https://mededu.jmir.org/2024/1/e59133
UR  - http://dx.doi.org/10.2196/59133
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39137031
ID  - info:doi/10.2196/59133
ER  - 

TY  - JOUR
AU  - McBee, C. Joseph
AU  - Han, Y. Daniel
AU  - Liu, Li
AU  - Ma, Leah
AU  - Adjeroh, A. Donald
AU  - Xu, Dong
AU  - Hu, Gangqing
PY  - 2024/8/7
TI  - Assessing ChatGPT?s Competency in Addressing Interdisciplinary Inquiries on Chatbot Uses in Sports Rehabilitation: Simulation Study
JO  - JMIR Med Educ
SP  - e51157
VL  - 10
KW  - ChatGPT
KW  - chatbots
KW  - multirole-playing
KW  - interdisciplinary inquiry
KW  - medical education
KW  - sports medicine
N2  - Background: ChatGPT showcases exceptional conversational capabilities and extensive cross-disciplinary knowledge. In addition, it can perform multiple roles in a single chat session. This unique multirole-playing feature positions ChatGPT as a promising tool for exploring interdisciplinary subjects. Objective: The aim of this study was to evaluate ChatGPT?s competency in addressing interdisciplinary inquiries based on a case study exploring the opportunities and challenges of chatbot uses in sports rehabilitation. Methods: We developed a model termed PanelGPT to assess ChatGPT?s competency in addressing interdisciplinary topics through simulated panel discussions. Taking chatbot uses in sports rehabilitation as an example of an interdisciplinary topic, we prompted ChatGPT through PanelGPT to role-play a physiotherapist, psychologist, nutritionist, artificial intelligence expert, and athlete in a simulated panel discussion. During the simulation, we posed questions to the panel while ChatGPT acted as both the panelists for responses and the moderator for steering the discussion. We performed the simulation using ChatGPT-4 and evaluated the responses by referring to the literature and our human expertise. Results: By tackling questions related to chatbot uses in sports rehabilitation with respect to patient education, physiotherapy, physiology, nutrition, and ethical considerations, responses from the ChatGPT-simulated panel discussion reasonably pointed to various benefits such as 24/7 support, personalized advice, automated tracking, and reminders. ChatGPT also correctly emphasized the importance of patient education, and identified challenges such as limited interaction modes, inaccuracies in emotion-related advice, assurance of data privacy and security, transparency in data handling, and fairness in model training. It also stressed that chatbots are to assist as a copilot, not to replace human health care professionals in the rehabilitation process. Conclusions: ChatGPT exhibits strong competency in addressing interdisciplinary inquiry by simulating multiple experts from complementary backgrounds, with significant implications in assisting medical education. 
UR  - https://mededu.jmir.org/2024/1/e51157
UR  - http://dx.doi.org/10.2196/51157
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39042885
ID  - info:doi/10.2196/51157
ER  - 

TY  - JOUR
AU  - Aljamaan, Fadi
AU  - Temsah, Mohamad-Hani
AU  - Altamimi, Ibraheem
AU  - Al-Eyadhy, Ayman
AU  - Jamal, Amr
AU  - Alhasan, Khalid
AU  - Mesallam, A. Tamer
AU  - Farahat, Mohamed
AU  - Malki, H. Khalid
PY  - 2024/7/31
TI  - Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study
JO  - JMIR Med Inform
SP  - e54345
VL  - 12
KW  - artificial intelligence (AI) chatbots
KW  - reference hallucination
KW  - bibliographic verification
KW  - ChatGPT
KW  - Perplexity
KW  - SciSpace
KW  - Elicit
KW  - Bing
N2  - Background: Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation. Objective: The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots? citations. Methods: Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference?s relevance to prompts? keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots. Results: Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (? coefficient=?0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (? coefficient=?0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (? coefficient=0.486; P<.001). Conclusions: The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots? RHS could contribute to ongoing efforts to enhance AI?s general reliability in medical research. 
UR  - https://medinform.jmir.org/2024/1/e54345
UR  - http://dx.doi.org/10.2196/54345
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/54345
ER  - 

TY  - JOUR
AU  - Zhui, Li
AU  - Yhap, Nina
AU  - Liping, Liu
AU  - Zhengjie, Wang
AU  - Zhonghao, Xiong
AU  - Xiaoshu, Yuan
AU  - Hong, Cui
AU  - Xuexiu, Liu
AU  - Wei, Ren
PY  - 2024/7/25
TI  - Impact of Large Language Models on Medical Education and Teaching Adaptations
JO  - JMIR Med Inform
SP  - e55933
VL  - 12
KW  - large language models
KW  - medical education
KW  - opportunities
KW  - challenges
KW  - critical thinking
KW  - educator
UR  - https://medinform.jmir.org/2024/1/e55933
UR  - http://dx.doi.org/10.2196/55933
ID  - info:doi/10.2196/55933
ER  - 

TY  - JOUR
AU  - Burke, B. Harry
AU  - Hoang, Albert
AU  - Lopreiato, O. Joseph
AU  - King, Heidi
AU  - Hemmer, Paul
AU  - Montgomery, Michael
AU  - Gagarin, Viktoria
PY  - 2024/7/25
TI  - Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study
JO  - JMIR Med Educ
SP  - e56342
VL  - 10
KW  - medical education
KW  - generative artificial intelligence
KW  - natural language processing
KW  - ChatGPT
KW  - generative pretrained transformer
KW  - standardized patients
KW  - clinical notes
KW  - free-text notes
KW  - history and physical examination
KW  - large language model
KW  - LLM
KW  - medical student
KW  - medical students
KW  - clinical information
KW  - artificial intelligence
KW  - AI
KW  - patients
KW  - patient
KW  - medicine
N2  - Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students? free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students? notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students? standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. 
UR  - https://mededu.jmir.org/2024/1/e56342
UR  - http://dx.doi.org/10.2196/56342
ID  - info:doi/10.2196/56342
ER  - 

TY  - JOUR
AU  - Noroozi, Mohammad
AU  - St John, Ace
AU  - Masino, Caterina
AU  - Laplante, Simon
AU  - Hunter, Jaryd
AU  - Brudno, Michael
AU  - Madani, Amin
AU  - Kersten-Oertel, Marta
PY  - 2024/7/25
TI  - Education in Laparoscopic Cholecystectomy: Design and Feasibility Study of the LapBot Safe Chole Mobile Game
JO  - JMIR Form Res
SP  - e52878
VL  - 8
KW  - gamification
KW  - serious games
KW  - surgery
KW  - education
KW  - laparoscopic cholecystectomy
KW  - artificial intelligence
KW  - AI
KW  - laparoscope
KW  - gallbladder
KW  - cholecystectomy
KW  - mobile game
KW  - gamify
KW  - educational game
KW  - interactive
KW  - decision-making
KW  - mobile phone
N2  - Background:  Major bile duct injuries during laparoscopic cholecystectomy (LC), often stemming from errors in surgical judgment and visual misperception of critical anatomy, significantly impact morbidity, mortality, disability, and health care costs. Objective:  To enhance safe LC learning, we developed an educational mobile game, LapBot Safe Chole, which uses an artificial intelligence (AI) model to provide real-time coaching and feedback, improving intraoperative decision-making. Methods:  LapBot Safe Chole offers a free, accessible simulated learning experience with real-time AI feedback. Players engage with intraoperative LC scenarios (short video clips) and identify ideal dissection zones. After the response, users receive an accuracy score from a validated AI algorithm. The game consists of 5 levels of increasing difficulty based on the Parkland grading scale for cholecystitis. Results:  Beta testing (n=29) showed score improvements with each round, with attendings and senior trainees achieving top scores faster than junior residents. Learning curves and progression distinguished candidates, with a significant association between user level and scores (P=.003). Players found LapBot enjoyable and educational. Conclusions:  LapBot Safe Chole effectively integrates safe LC principles into a fun, accessible, and educational game using AI-generated feedback. Initial beta testing supports the validity of the assessment scores and suggests high adoption and engagement potential among surgical trainees. 
UR  - https://formative.jmir.org/2024/1/e52878
UR  - http://dx.doi.org/10.2196/52878
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/52878
ER  - 

TY  - JOUR
AU  - Cherif, Hela
AU  - Moussa, Chirine
AU  - Missaoui, Mouhaymen Abdel
AU  - Salouage, Issam
AU  - Mokaddem, Salma
AU  - Dhahri, Besma
PY  - 2024/7/23
TI  - Appraisal of ChatGPT?s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination
JO  - JMIR Med Educ
SP  - e52818
VL  - 10
KW  - medical education
KW  - ChatGPT
KW  - GPT
KW  - artificial intelligence
KW  - natural language processing
KW  - NLP
KW  - pulmonary medicine
KW  - pulmonary
KW  - lung
KW  - lungs
KW  - respiratory
KW  - respiration
KW  - pneumology
KW  - comparative analysis
KW  - large language models
KW  - LLMs
KW  - LLM
KW  - language model
KW  - generative AI
KW  - generative artificial intelligence
KW  - generative
KW  - exams
KW  - exam
KW  - examinations
KW  - examination
N2  - Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT?s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution?s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. 
UR  - https://mededu.jmir.org/2024/1/e52818
UR  - http://dx.doi.org/10.2196/52818
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/52818
ER  - 

TY  - JOUR
AU  - Laymouna, Moustafa
AU  - Ma, Yuanchao
AU  - Lessard, David
AU  - Schuster, Tibor
AU  - Engler, Kim
AU  - Lebouché, Bertrand
PY  - 2024/7/23
TI  - Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review
JO  - J Med Internet Res
SP  - e56930
VL  - 26
KW  - chatbot
KW  - conversational agent
KW  - conversational assistant
KW  - user-computer interface
KW  - digital health
KW  - mobile health
KW  - electronic health
KW  - telehealth
KW  - artificial intelligence
KW  - AI
KW  - health information technology
N2  - Background: Chatbots, or conversational agents, have emerged as significant tools in health care, driven by advancements in artificial intelligence and digital technology. These programs are designed to simulate human conversations, addressing various health care needs. However, no comprehensive synthesis of health care chatbots? roles, users, benefits, and limitations is available to inform future research and application in the field. Objective: This review aims to describe health care chatbots? characteristics, focusing on their diverse roles in the health care pathway, user groups, benefits, and limitations. Methods: A rapid review of published literature from 2017 to 2023 was performed with a search strategy developed in collaboration with a health sciences librarian and implemented in the MEDLINE and Embase databases. Primary research studies reporting on chatbot roles or benefits in health care were included. Two reviewers dual-screened the search results. Extracted data on chatbot roles, users, benefits, and limitations were subjected to content analysis. Results: The review categorized chatbot roles into 2 themes: delivery of remote health services, including patient support, care management, education, skills building, and health behavior promotion, and provision of administrative assistance to health care providers. User groups spanned across patients with chronic conditions as well as patients with cancer; individuals focused on lifestyle improvements; and various demographic groups such as women, families, and older adults. Professionals and students in health care also emerged as significant users, alongside groups seeking mental health support, behavioral change, and educational enhancement. The benefits of health care chatbots were also classified into 2 themes: improvement of health care quality and efficiency and cost-effectiveness in health care delivery. The identified limitations encompassed ethical challenges, medicolegal and safety concerns, technical difficulties, user experience issues, and societal and economic impacts. Conclusions: Health care chatbots offer a wide spectrum of applications, potentially impacting various aspects of health care. While they are promising tools for improving health care efficiency and quality, their integration into the health care system must be approached with consideration of their limitations to ensure optimal, safe, and equitable use. 
UR  - https://www.jmir.org/2024/1/e56930
UR  - http://dx.doi.org/10.2196/56930
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/56930
ER  - 

TY  - JOUR
AU  - Tolentino, Raymond
AU  - Baradaran, Ashkan
AU  - Gore, Genevieve
AU  - Pluye, Pierre
AU  - Abbasgholizadeh-Rahimi, Samira
PY  - 2024/7/18
TI  - Curriculum Frameworks and Educational Programs in AI for Medical Students, Residents, and Practicing Physicians: Scoping Review
JO  - JMIR Med Educ
SP  - e54793
VL  - 10
KW  - artificial intelligence
KW  - machine learning
KW  - curriculum
KW  - framework
KW  - medical education
KW  - review
N2  - Background: The successful integration of artificial intelligence (AI) into clinical practice is contingent upon physicians? comprehension of AI principles and its applications. Therefore, it is essential for medical education curricula to incorporate AI topics and concepts, providing future physicians with the foundational knowledge and skills needed. However, there is a knowledge gap in the current understanding and availability of structured AI curriculum frameworks tailored for medical education, which serve as vital guides for instructing and facilitating the learning process. Objective: The overall aim of this study is to synthesize knowledge from the literature on curriculum frameworks and current educational programs that focus on the teaching and learning of AI for medical students, residents, and practicing physicians. Methods: We followed a validated framework and the Joanna Briggs Institute methodological guidance for scoping reviews. An information specialist performed a comprehensive search from 2000 to May 2023 in the following bibliographic databases: MEDLINE (Ovid), Embase (Ovid), CENTRAL (Cochrane Library), CINAHL (EBSCOhost), and Scopus as well as the gray literature. Papers were limited to English and French languages. This review included papers that describe curriculum frameworks for teaching and learning AI in medicine, irrespective of country. All types of papers and study designs were included, except conference abstracts and protocols. Two reviewers independently screened the titles and abstracts, read the full texts, and extracted data using a validated data extraction form. Disagreements were resolved by consensus, and if this was not possible, the opinion of a third reviewer was sought. We adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist for reporting the results. Results: Of the 5104 papers screened, 21 papers relevant to our eligibility criteria were identified. In total, 90% (19/21) of the papers altogether described 30 current or previously offered educational programs, and 10% (2/21) of the papers described elements of a curriculum framework. One framework describes a general approach to integrating AI curricula throughout the medical learning continuum and another describes a core curriculum for AI in ophthalmology. No papers described a theory, pedagogy, or framework that guided the educational programs. Conclusions: This review synthesizes recent advancements in AI curriculum frameworks and educational programs within the domain of medical education. To build on this foundation, future researchers are encouraged to engage in a multidisciplinary approach to curriculum redesign. In addition, it is encouraged to initiate dialogues on the integration of AI into medical curriculum planning and to investigate the development, deployment, and appraisal of these innovative educational programs. International Registered Report Identifier (IRRID): RR2-10.11124/JBIES-22-00374 
UR  - https://mededu.jmir.org/2024/1/e54793
UR  - http://dx.doi.org/10.2196/54793
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39023999
ID  - info:doi/10.2196/54793
ER  - 

TY  - JOUR
AU  - Jo, Eunbeen
AU  - Song, Sanghoun
AU  - Kim, Jong-Ho
AU  - Lim, Subin
AU  - Kim, Hyeon Ju
AU  - Cha, Jung-Joon
AU  - Kim, Young-Min
AU  - Joo, Joon Hyung
PY  - 2024/7/8
TI  - Assessing GPT-4?s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts
JO  - JMIR Med Educ
SP  - e51282
VL  - 10
KW  - GPT-4
KW  - medical advice
KW  - ChatGPT
KW  - cardiology
KW  - cardiologist
KW  - heart
KW  - advice
KW  - recommendation
KW  - recommendations
KW  - linguistic
KW  - linguistics
KW  - artificial intelligence
KW  - NLP
KW  - natural language processing
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - response
KW  - responses
N2  - Background: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI?s GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. Objective: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. Methods: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. Results: GPT-4 and human experts displayed comparable efficacy in medical accuracy (?GPT-4 is better? at 132/251, 52.6% vs ?Human expert is better? at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. Conclusions: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions. 
UR  - https://mededu.jmir.org/2024/1/e51282
UR  - http://dx.doi.org/10.2196/51282
ID  - info:doi/10.2196/51282
ER  - 

TY  - JOUR
AU  - Hassanipour, Soheil
AU  - Nayak, Sandeep
AU  - Bozorgi, Ali
AU  - Keivanlou, Mohammad-Hossein
AU  - Dave, Tirth
AU  - Alotaibi, Abdulhadi
AU  - Joukar, Farahnaz
AU  - Mellatdoust, Parinaz
AU  - Bakhshi, Arash
AU  - Kuriyakose, Dona
AU  - Polisetty, D. Lakshmi
AU  - Chimpiri, Mallika
AU  - Amini-Salehi, Ehsan
PY  - 2024/7/8
TI  - The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis
JO  - JMIR Med Educ
SP  - e53308
VL  - 10
KW  - ChatGPT
KW  - paraphrasing
KW  - text generation
KW  - prompts
KW  - academic journals
KW  - plagiarize
KW  - plagiarism
KW  - paraphrase
KW  - wording
KW  - LLM
KW  - LLMs
KW  - language model
KW  - language models
KW  - prompt
KW  - generative
KW  - artificial intelligence
KW  - NLP
KW  - natural language processing
KW  - rephrase
KW  - plagiarizing
KW  - honesty
KW  - integrity
KW  - texts
KW  - text
KW  - textual
KW  - generation
KW  - large language model
KW  - large language models
N2  - Background: The introduction of ChatGPT by OpenAI has garnered significant attention. Among its capabilities, paraphrasing stands out. Objective: This study aims to investigate the satisfactory levels of plagiarism in the paraphrased text produced by this chatbot. Methods: Three texts of varying lengths were presented to ChatGPT. ChatGPT was then instructed to paraphrase the provided texts using five different prompts. In the subsequent stage of the study, the texts were divided into separate paragraphs, and ChatGPT was requested to paraphrase each paragraph individually. Lastly, in the third stage, ChatGPT was asked to paraphrase the texts it had previously generated. Results: The average plagiarism rate in the texts generated by ChatGPT was 45% (SD 10%). ChatGPT exhibited a substantial reduction in plagiarism for the provided texts (mean difference ?0.51, 95% CI ?0.54 to ?0.48; P<.001). Furthermore, when comparing the second attempt with the initial attempt, a significant decrease in the plagiarism rate was observed (mean difference ?0.06, 95% CI ?0.08 to ?0.03; P<.001). The number of paragraphs in the texts demonstrated a noteworthy association with the percentage of plagiarism, with texts consisting of a single paragraph exhibiting the lowest plagiarism rate (P<.001). Conclusion: Although ChatGPT demonstrates a notable reduction of plagiarism within texts, the existing levels of plagiarism remain relatively high. This underscores a crucial caution for researchers when incorporating this chatbot into their work. 
UR  - https://mededu.jmir.org/2024/1/e53308
UR  - http://dx.doi.org/10.2196/53308
ID  - info:doi/10.2196/53308
ER  - 

TY  - JOUR
AU  - Shikino, Kiyoshi
AU  - Shimizu, Taro
AU  - Otsuka, Yuki
AU  - Tago, Masaki
AU  - Takahashi, Hiromizu
AU  - Watari, Takashi
AU  - Sasaki, Yosuke
AU  - Iizuka, Gemmei
AU  - Tamura, Hiroki
AU  - Nakashima, Koichi
AU  - Kunitomo, Kotaro
AU  - Suzuki, Morika
AU  - Aoyama, Sayaka
AU  - Kosaka, Shintaro
AU  - Kawahigashi, Teiko
AU  - Matsumoto, Tomohiro
AU  - Orihara, Fumina
AU  - Morikawa, Toru
AU  - Nishizawa, Toshinori
AU  - Hoshina, Yoji
AU  - Yamamoto, Yu
AU  - Matsuo, Yuichiro
AU  - Unoki, Yuto
AU  - Kimura, Hirofumi
AU  - Tokushima, Midori
AU  - Watanuki, Satoshi
AU  - Saito, Takuma
AU  - Otsuka, Fumio
AU  - Tokuda, Yasuharu
PY  - 2024/6/21
TI  - Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research
JO  - JMIR Med Educ
SP  - e58758
VL  - 10
KW  - atypical presentation
KW  - ChatGPT
KW  - common disease
KW  - diagnostic accuracy
KW  - diagnosis
KW  - patient safety
N2  - Background: The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations. Objective: This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model?s reliance on patient history during the diagnostic process. Methods: We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5). Results: ChatGPT?s diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The ?2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (?²1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (?²1=4.01; n=25; P=.048). Conclusions: ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings. 
UR  - https://mededu.jmir.org/2024/1/e58758
UR  - http://dx.doi.org/10.2196/58758
ID  - info:doi/10.2196/58758
ER  - 

TY  - JOUR
AU  - Zhang, Fang
AU  - Liu, Xiaoliu
AU  - Wu, Wenyan
AU  - Zhu, Shiben
PY  - 2024/6/13
TI  - Evolution of Chatbots in Nursing Education: Narrative Review
JO  - JMIR Med Educ
SP  - e54987
VL  - 10
KW  - nursing education
KW  - chatbots
KW  - artificial intelligence
KW  - narrative review
KW  - ChatGPT
N2  - Background: The integration of chatbots in nursing education is a rapidly evolving area with potential transformative impacts. This narrative review aims to synthesize and analyze the existing literature on chatbots in nursing education. Objective: This study aims to comprehensively examine the temporal trends, international distribution, study designs, and implications of chatbots in nursing education. Methods: A comprehensive search was conducted across 3 databases (PubMed, Web of Science, and Embase) following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram. Results: A total of 40 articles met the eligibility criteria, with a notable increase of publications in 2023 (n=28, 70%). Temporal analysis revealed a notable surge in publications from 2021 to 2023, emphasizing the growing scholarly interest. Geographically, Taiwan province made substantial contributions (n=8, 20%), followed by the United States (n=6, 15%) and South Korea (n=4, 10%). Study designs varied, with reviews (n=8, 20%) and editorials (n=7, 18%) being predominant, showcasing the richness of research in this domain. Conclusions: Integrating chatbots into nursing education presents a promising yet relatively unexplored avenue. This review highlights the urgent need for original research, emphasizing the importance of ethical considerations. 
UR  - https://mededu.jmir.org/2024/1/e54987
UR  - http://dx.doi.org/10.2196/54987
ID  - info:doi/10.2196/54987
ER  - 

TY  - JOUR
AU  - Srinivasan, Muthuvenkatachalam
AU  - Venugopal, Ambili
AU  - Venkatesan, Latha
AU  - Kumar, Rajesh
PY  - 2024/6/13
TI  - Navigating the Pedagogical Landscape: Exploring the Implications of AI and Chatbots in Nursing Education
JO  - JMIR Nursing
SP  - e52105
VL  - 7
KW  - AI
KW  - artificial intelligence
KW  - ChatGPT
KW  - chatbots
KW  - nursing education
KW  - education
KW  - chatbot
KW  - nursing
KW  - ethical
KW  - ethics
KW  - ethical consideration
KW  - accessible
KW  - learning
KW  - efficiency
KW  - student
KW  - student engagement
KW  - student learning
UR  - https://nursing.jmir.org/2024/1/e52105
UR  - http://dx.doi.org/10.2196/52105
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38870516
ID  - info:doi/10.2196/52105
ER  - 

TY  - JOUR
AU  - Moldt, Julia-Astrid
AU  - Festl-Wietek, Teresa
AU  - Fuhl, Wolfgang
AU  - Zabel, Susanne
AU  - Claassen, Manfred
AU  - Wagner, Samuel
AU  - Nieselt, Kay
AU  - Herrmann-Werner, Anne
PY  - 2024/6/12
TI  - Assessing AI Awareness and Identifying Essential Competencies: Insights From Key Stakeholders in Integrating AI Into Medical Education
JO  - JMIR Med Educ
SP  - e58355
VL  - 10
KW  - AI in medicine
KW  - artificial intelligence
KW  - medical education
KW  - medical students
KW  - qualitative approach
KW  - qualitative analysis
KW  - needs assessment
N2  - Background: The increasing importance of artificial intelligence (AI) in health care has generated a growing need for health care professionals to possess a comprehensive understanding of AI technologies, requiring an adaptation in medical education. Objective: This paper explores stakeholder perceptions and expectations regarding AI in medicine and examines their potential impact on the medical curriculum. This study project aims to assess the AI experiences and awareness of different stakeholders and identify essential AI-related topics in medical education to define necessary competencies for students. Methods: The empirical data were collected as part of the TüKITZMed project between August 2022 and March 2023, using a semistructured qualitative interview. These interviews were administered to a diverse group of stakeholders to explore their experiences and perspectives of AI in medicine. A qualitative content analysis of the collected data was conducted using MAXQDA software. Results: Semistructured interviews were conducted with 38 participants (6 lecturers, 9 clinicians, 10 students, 6 AI experts, and 7 institutional stakeholders). The qualitative content analysis revealed 6 primary categories with a total of 24 subcategories to answer the research questions. The evaluation of the stakeholders? statements revealed several commonalities and differences regarding their understanding of AI. Crucial identified AI themes based on the main categories were as follows: possible curriculum contents, skills, and competencies; programming skills; curriculum scope; and curriculum structure. Conclusions: The analysis emphasizes integrating AI into medical curricula to ensure students? proficiency in clinical applications. Standardized AI comprehension is crucial for defining and teaching relevant content. Considering diverse perspectives in implementation is essential to comprehensively define AI in the medical context, addressing gaps and facilitating effective solutions for future AI use in medical studies. The results provide insights into potential curriculum content and structure, including aspects of AI in medicine. 
UR  - https://mededu.jmir.org/2024/1/e58355
UR  - http://dx.doi.org/10.2196/58355
ID  - info:doi/10.2196/58355
ER  - 

TY  - JOUR
AU  - Arango-Ibanez, Pablo Juan
AU  - Posso-Nuñez, Alejandro Jose
AU  - Díaz-Solórzano, Pablo Juan
AU  - Cruz-Suárez, Gustavo
PY  - 2024/5/24
TI  - Evidence-Based Learning Strategies in Medicine Using AI
JO  - JMIR Med Educ
SP  - e54507
VL  - 10
KW  - artificial intelligence
KW  - large language models
KW  - ChatGPT
KW  - active recall
KW  - memory cues
KW  - LLMs
KW  - evidence-based
KW  - learning strategy
KW  - medicine
KW  - AI
KW  - medical education
KW  - knowledge
KW  - relevance
UR  - https://mededu.jmir.org/2024/1/e54507
UR  - http://dx.doi.org/10.2196/54507
ID  - info:doi/10.2196/54507
ER  - 

TY  - JOUR
AU  - Takagi, Soshi
AU  - Koda, Masahide
AU  - Watari, Takashi
PY  - 2024/5/23
TI  - The Performance of ChatGPT-4V in Interpreting Images and Tables in the Japanese Medical Licensing Exam
JO  - JMIR Med Educ
SP  - e54283
VL  - 10
KW  - ChatGPT
KW  - medical licensing examination
KW  - generative artificial intelligence
KW  - medical education
KW  - large language model
KW  - images
KW  - tables
KW  - artificial intelligence
KW  - AI
KW  - Japanese
KW  - reliability
KW  - medical application
KW  - medical applications
KW  - diagnostic
KW  - diagnostics
KW  - online data
KW  - web-based data
UR  - https://mededu.jmir.org/2024/1/e54283
UR  - http://dx.doi.org/10.2196/54283
ID  - info:doi/10.2196/54283
ER  - 

TY  - JOUR
AU  - Wang, Shangqiguo
AU  - Mo, Changgeng
AU  - Chen, Yuan
AU  - Dai, Xiaolu
AU  - Wang, Huiyi
AU  - Shen, Xiaoli
PY  - 2024/4/26
TI  - Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care
JO  - JMIR Med Educ
SP  - e55595
VL  - 10
KW  - ChatGPT
KW  - medical education
KW  - artificial intelligence
KW  - AI
KW  - audiology
KW  - hearing care
KW  - natural language processing
KW  - large language model
KW  - Taiwan
KW  - hearing
KW  - hearing specialist
KW  - audiologist
KW  - examination
KW  - information accuracy
KW  - educational technology
KW  - healthcare services
KW  - chatbot
KW  - health care services
N2  - Background: Artificial intelligence (AI) chatbots, such as ChatGPT-4, have shown immense potential for application across various aspects of medicine, including medical education, clinical practice, and research. Objective: This study aimed to evaluate the performance of ChatGPT-4 in the 2023 Taiwan Audiologist Qualification Examination, thereby preliminarily exploring the potential utility of AI chatbots in the fields of audiology and hearing care services. Methods: ChatGPT-4 was tasked to provide answers and reasoning for the 2023 Taiwan Audiologist Qualification Examination. The examination encompassed six subjects: (1) basic auditory science, (2) behavioral audiology, (3) electrophysiological audiology, (4) principles and practice of hearing devices, (5) health and rehabilitation of the auditory and balance systems, and (6) auditory and speech communication disorders (including professional ethics). Each subject included 50 multiple-choice questions, with the exception of behavioral audiology, which had 49 questions, amounting to a total of 299 questions. Results: The correct answer rates across the 6 subjects were as follows: 88% for basic auditory science, 63% for behavioral audiology, 58% for electrophysiological audiology, 72% for principles and practice of hearing devices, 80% for health and rehabilitation of the auditory and balance systems, and 86% for auditory and speech communication disorders (including professional ethics). The overall accuracy rate for the 299 questions was 75%, which surpasses the examination?s passing criteria of an average 60% accuracy rate across all subjects. A comprehensive review of ChatGPT-4?s responses indicated that incorrect answers were predominantly due to information errors. Conclusions: ChatGPT-4 demonstrated a robust performance in the Taiwan Audiologist Qualification Examination, showcasing effective logical reasoning skills. Our results suggest that with enhanced information accuracy, ChatGPT-4?s performance could be further improved. This study indicates significant potential for the application of AI chatbots in audiology and hearing care services. 
UR  - https://mededu.jmir.org/2024/1/e55595
UR  - http://dx.doi.org/10.2196/55595
ID  - info:doi/10.2196/55595
ER  - 

TY  - JOUR
AU  - Choudhury, Avishek
AU  - Chaudhry, Zaira
PY  - 2024/4/25
TI  - Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals
JO  - J Med Internet Res
SP  - e56764
VL  - 26
KW  - trust
KW  - ChatGPT
KW  - human factors
KW  - healthcare
KW  - LLMs
KW  - large language models
KW  - LLM user trust
KW  - AI accountability
KW  - artificial intelligence
KW  - AI technology
KW  - technologies
KW  - effectiveness
KW  - policy
KW  - medical student
KW  - medical students
KW  - risk factor
KW  - quality of care
KW  - healthcare professional
KW  - healthcare professionals
KW  - human element
UR  - https://www.jmir.org/2024/1/e56764
UR  - http://dx.doi.org/10.2196/56764
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38662419
ID  - info:doi/10.2196/56764
ER  - 

TY  - JOUR
AU  - Wu, Yijun
AU  - Zheng, Yue
AU  - Feng, Baijie
AU  - Yang, Yuqi
AU  - Kang, Kai
AU  - Zhao, Ailin
PY  - 2024/4/10
TI  - Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students
JO  - JMIR Med Educ
SP  - e52483
VL  - 10
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - medical education
KW  - doctors
KW  - medical students
UR  - https://mededu.jmir.org/2024/1/e52483
UR  - http://dx.doi.org/10.2196/52483
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38598263
ID  - info:doi/10.2196/52483
ER  - 

TY  - JOUR
AU  - Fukuzawa, Fumitoshi
AU  - Yanagita, Yasutaka
AU  - Yokokawa, Daiki
AU  - Uchida, Shun
AU  - Yamashita, Shiho
AU  - Li, Yu
AU  - Shikino, Kiyoshi
AU  - Tsukamoto, Tomoko
AU  - Noda, Kazutaka
AU  - Uehara, Takanori
AU  - Ikusaka, Masatomi
PY  - 2024/4/8
TI  - Importance of Patient History in Artificial Intelligence?Assisted Medical Diagnosis: Comparison Study
JO  - JMIR Med Educ
SP  - e52674
VL  - 10
KW  - medical diagnosis
KW  - ChatGPT
KW  - AI in medicine
KW  - diagnostic accuracy
KW  - patient history
KW  - medical history
KW  - artificial intelligence
KW  - AI
KW  - physical examination
KW  - physical examinations
KW  - laboratory investigation
KW  - laboratory investigations
KW  - mHealth
KW  - accuracy
KW  - public health
KW  - United States
KW  - AI diagnosis
KW  - treatment
KW  - male
KW  - female
KW  - child
KW  - children
KW  - youth
KW  - adolescent
KW  - adolescents
KW  - teen
KW  - teens
KW  - teenager
KW  - teenagers
KW  - older adult
KW  - older adults
KW  - elder
KW  - elderly
KW  - older person
KW  - older people
KW  - investigative
KW  - mobile health
KW  - digital health
N2  - Background: Medical history contributes approximately 80% to a diagnosis, although physical examinations and laboratory investigations increase a physician?s confidence in the medical diagnosis. The concept of artificial intelligence (AI) was first proposed more than 70 years ago. Recently, its role in various fields of medicine has grown remarkably. However, no studies have evaluated the importance of patient history in AI-assisted medical diagnosis. Objective: This study explored the contribution of patient history to AI-assisted medical diagnoses and assessed the accuracy of ChatGPT in reaching a clinical diagnosis based on the medical history provided. Methods: Using clinical vignettes of 30 cases identified in The BMJ, we evaluated the accuracy of diagnoses generated by ChatGPT. We compared the diagnoses made by ChatGPT based solely on medical history with the correct diagnoses. We also compared the diagnoses made by ChatGPT after incorporating additional physical examination findings and laboratory data alongside history with the correct diagnoses. Results: ChatGPT accurately diagnosed 76.6% (23/30) of the cases with only the medical history, consistent with previous research targeting physicians. We also found that this rate was 93.3% (28/30) when additional information was included. Conclusions: Although adding additional information improves diagnostic accuracy, patient history remains a significant factor in AI-assisted medical diagnosis. Thus, when using AI in medical diagnosis, it is crucial to include pertinent and correct patient histories for an accurate diagnosis. Our findings emphasize the continued significance of patient history in clinical diagnoses in this age and highlight the need for its integration into AI-assisted medical diagnosis systems. 
UR  - https://mededu.jmir.org/2024/1/e52674
UR  - http://dx.doi.org/10.2196/52674
ID  - info:doi/10.2196/52674
ER  - 

TY  - JOUR
AU  - Noda, Masao
AU  - Ueno, Takayoshi
AU  - Koshu, Ryota
AU  - Takaso, Yuji
AU  - Shimada, Dias Mari
AU  - Saito, Chizu
AU  - Sugimoto, Hisashi
AU  - Fushiki, Hiroaki
AU  - Ito, Makoto
AU  - Nomura, Akihiro
AU  - Yoshizaki, Tomokazu
PY  - 2024/3/28
TI  - Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
JO  - JMIR Med Educ
SP  - e57054
VL  - 10
KW  - artificial intelligence
KW  - GPT-4v
KW  - large language model
KW  - otolaryngology
KW  - GPT
KW  - ChatGPT
KW  - LLM
KW  - LLMs
KW  - language model
KW  - language models
KW  - head
KW  - respiratory
KW  - ENT: ear
KW  - nose
KW  - throat
KW  - neck
KW  - NLP
KW  - natural language processing
KW  - image
KW  - images
KW  - exam
KW  - exams
KW  - examination
KW  - examinations
KW  - answer
KW  - answers
KW  - answering
KW  - response
KW  - responses
N2  - Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence?s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. 
UR  - https://mededu.jmir.org/2024/1/e57054
UR  - http://dx.doi.org/10.2196/57054
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38546736
ID  - info:doi/10.2196/57054
ER  - 

TY  - JOUR
AU  - Gandhi, P. Aravind
AU  - Joesph, Karen Felista
AU  - Rajagopal, Vineeth
AU  - Aparnavi, P.
AU  - Katkuri, Sushma
AU  - Dayama, Sonal
AU  - Satapathy, Prakasini
AU  - Khatib, Nazli Mahalaqua
AU  - Gaidhane, Shilpa
AU  - Zahiruddin, Syed Quazi
AU  - Behera, Ashish
PY  - 2024/3/25
TI  - Performance of ChatGPT on the India Undergraduate Community Medicine Examination: Cross-Sectional Study
JO  - JMIR Form Res
SP  - e49964
VL  - 8
KW  - artificial intelligence
KW  - ChatGPT
KW  - community medicine
KW  - India
KW  - large language model
KW  - medical education
KW  - digitalization
N2  - Background: Medical students may increasingly use large language models (LLMs) in their learning. ChatGPT is an LLM at the forefront of this new development in medical education with the capacity to respond to multidisciplinary questions. Objective: The aim of this study was to evaluate the ability of ChatGPT 3.5 to complete the Indian undergraduate medical examination in the subject of community medicine. We further compared ChatGPT scores with the scores obtained by the students. Methods: The study was conducted at a publicly funded medical college in Hyderabad, India. The study was based on the internal assessment examination conducted in January 2023 for students in the Bachelor of Medicine and Bachelor of Surgery Final Year?Part I program; the examination of focus included 40 questions (divided between two papers) from the community medicine subject syllabus. Each paper had three sections with different weightage of marks for each section: section one had two long essay?type questions worth 15 marks each, section two had 8 short essay?type questions worth 5 marks each, and section three had 10 short-answer questions worth 3 marks each. The same questions were administered as prompts to ChatGPT 3.5 and the responses were recorded. Apart from scoring ChatGPT responses, two independent evaluators explored the responses to each question to further analyze their quality with regard to three subdomains: relevancy, coherence, and completeness. Each question was scored in these subdomains on a Likert scale of 1-5. The average of the two evaluators was taken as the subdomain score of the question. The proportion of questions with a score 50% of the maximum score (5) in each subdomain was calculated. Results: ChatGPT 3.5 scored 72.3% on paper 1 and 61% on paper 2. The mean score of the 94 students was 43% on paper 1 and 45% on paper 2. The responses of ChatGPT 3.5 were also rated to be satisfactorily relevant, coherent, and complete for most of the questions (>80%). Conclusions: ChatGPT 3.5 appears to have substantial and sufficient knowledge to understand and answer the Indian medical undergraduate examination in the subject of community medicine. ChatGPT may be introduced to students to enable the self-directed learning of community medicine in pilot mode. However, faculty oversight will be required as ChatGPT is still in the initial stages of development, and thus its potential and reliability of medical content from the Indian context need to be further explored comprehensively. 
UR  - https://formative.jmir.org/2024/1/e49964
UR  - http://dx.doi.org/10.2196/49964
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38526538
ID  - info:doi/10.2196/49964
ER  - 

TY  - JOUR
AU  - Magalhães Araujo, Sabrina
AU  - Cruz-Correia, Ricardo
PY  - 2024/3/20
TI  - Incorporating ChatGPT in Medical Informatics Education: Mixed Methods Study on Student Perceptions and Experiential Integration Proposals
JO  - JMIR Med Educ
SP  - e51151
VL  - 10
KW  - education
KW  - medical informatics
KW  - artificial intelligence
KW  - AI
KW  - generative language model
KW  - ChatGPT
N2  - Background: The integration of artificial intelligence (AI) technologies, such as ChatGPT, in the educational landscape has the potential to enhance the learning experience of medical informatics students and prepare them for using AI in professional settings. The incorporation of AI in classes aims to develop critical thinking by encouraging students to interact with ChatGPT and critically analyze the responses generated by the chatbot. This approach also helps students develop important skills in the field of biomedical and health informatics to enhance their interaction with AI tools. Objective: The aim of the study is to explore the perceptions of students regarding the use of ChatGPT as a learning tool in their educational context and provide professors with examples of prompts for incorporating ChatGPT into their teaching and learning activities, thereby enhancing the educational experience for students in medical informatics courses. Methods: This study used a mixed methods approach to gain insights from students regarding the use of ChatGPT in education. To accomplish this, a structured questionnaire was applied to evaluate students? familiarity with ChatGPT, gauge their perceptions of its use, and understand their attitudes toward its use in academic and learning tasks. Learning outcomes of 2 courses were analyzed to propose ChatGPT?s incorporation in master?s programs in medicine and medical informatics. Results: The majority of students expressed satisfaction with the use of ChatGPT in education, finding it beneficial for various purposes, including generating academic content, brainstorming ideas, and rewriting text. While some participants raised concerns about potential biases and the need for informed use, the overall perception was positive. Additionally, the study proposed integrating ChatGPT into 2 specific courses in the master?s programs in medicine and medical informatics. The incorporation of ChatGPT was envisioned to enhance student learning experiences and assist in project planning, programming code generation, examination preparation, workflow exploration, and technical interview preparation, thus advancing medical informatics education. In medical teaching, it will be used as an assistant for simplifying the explanation of concepts and solving complex problems, as well as for generating clinical narratives and patient simulators. Conclusions: The study?s valuable insights into medical faculty students? perspectives and integration proposals for ChatGPT serve as an informative guide for professors aiming to enhance medical informatics education. The research delves into the potential of ChatGPT, emphasizes the necessity of collaboration in academic environments, identifies subject areas with discernible benefits, and underscores its transformative role in fostering innovative and engaging learning experiences. The envisaged proposals hold promise in empowering future health care professionals to work in the rapidly evolving era of digital health care. 
UR  - https://mededu.jmir.org/2024/1/e51151
UR  - http://dx.doi.org/10.2196/51151
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38506920
ID  - info:doi/10.2196/51151
ER  - 

TY  - JOUR
AU  - Nakao, Takahiro
AU  - Miki, Soichiro
AU  - Nakamura, Yuta
AU  - Kikuchi, Tomohiro
AU  - Nomura, Yukihiro
AU  - Hanaoka, Shouhei
AU  - Yoshikawa, Takeharu
AU  - Abe, Osamu
PY  - 2024/3/12
TI  - Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study
JO  - JMIR Med Educ
SP  - e54393
VL  - 10
KW  - AI
KW  - artificial intelligence
KW  - LLM
KW  - large language model
KW  - language model
KW  - language models
KW  - ChatGPT
KW  - GPT-4
KW  - GPT-4V
KW  - generative pretrained transformer
KW  - image
KW  - images
KW  - imaging
KW  - response
KW  - responses
KW  - exam
KW  - examination
KW  - exams
KW  - examinations
KW  - answer
KW  - answers
KW  - NLP
KW  - natural language processing
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - medical education
N2  - Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V?s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P?.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. 
UR  - https://mededu.jmir.org/2024/1/e54393
UR  - http://dx.doi.org/10.2196/54393
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38470459
ID  - info:doi/10.2196/54393
ER  - 

TY  - JOUR
AU  - Willms, Amanda
AU  - Liu, Sam
PY  - 2024/2/29
TI  - Exploring the Feasibility of Using ChatGPT to Create Just-in-Time Adaptive Physical Activity mHealth Intervention Content: Case Study
JO  - JMIR Med Educ
SP  - e51426
VL  - 10
KW  - ChatGPT
KW  - digital health
KW  - mobile health
KW  - mHealth
KW  - physical activity
KW  - application
KW  - mobile app
KW  - mobile apps
KW  - content creation
KW  - behavior change
KW  - app design
N2  - Background: Achieving physical activity (PA) guidelines? recommendation of 150 minutes of moderate-to-vigorous PA per week has been shown to reduce the risk of many chronic conditions. Despite the overwhelming evidence in this field, PA levels remain low globally. By creating engaging mobile health (mHealth) interventions through strategies such as just-in-time adaptive interventions (JITAIs) that are tailored to an individual?s dynamic state, there is potential to increase PA levels. However, generating personalized content can take a long time due to various versions of content required for the personalization algorithms. ChatGPT presents an incredible opportunity to rapidly produce tailored content; however, there is a lack of studies exploring its feasibility. Objective: This study aimed to (1) explore the feasibility of using ChatGPT to create content for a PA JITAI mobile app and (2) describe lessons learned and future recommendations for using ChatGPT in the development of mHealth JITAI content. Methods: During phase 1, we used Pathverse, a no-code app builder, and ChatGPT to develop a JITAI app to help parents support their child?s PA levels. The intervention was developed based on the Multi-Process Action Control (M-PAC) framework, and the necessary behavior change techniques targeting the M-PAC constructs were implemented in the app design to help parents support their child?s PA. The acceptability of using ChatGPT for this purpose was discussed to determine its feasibility. In phase 2, we summarized the lessons we learned during the JITAI content development process using ChatGPT and generated recommendations to inform future similar use cases. Results: In phase 1, by using specific prompts, we efficiently generated content for 13 lessons relating to increasing parental support for their child?s PA following the M-PAC framework. It was determined that using ChatGPT for this case study to develop PA content for a JITAI was acceptable. In phase 2, we summarized our recommendations into the following six steps when using ChatGPT to create content for mHealth behavior interventions: (1) determine target behavior, (2) ground the intervention in behavior change theory, (3) design the intervention structure, (4) input intervention structure and behavior change constructs into ChatGPT, (5) revise the ChatGPT response, and (6) customize the response to be used in the intervention. Conclusions: ChatGPT offers a remarkable opportunity for rapid content creation in the context of an mHealth JITAI. Although our case study demonstrated that ChatGPT was acceptable, it is essential to approach its use, along with other language models, with caution. Before delivering content to population groups, expert review is crucial to ensure accuracy and relevancy. Future research and application of these guidelines are imperative as we deepen our understanding of ChatGPT and its interactions with human input. 
UR  - https://mededu.jmir.org/2024/1/e51426
UR  - http://dx.doi.org/10.2196/51426
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38421689
ID  - info:doi/10.2196/51426
ER  - 

TY  - JOUR
AU  - Farhat, Faiza
AU  - Chaudhry, Moalla Beenish
AU  - Nadeem, Mohammad
AU  - Sohail, Saquib Shahab
AU  - Madsen, Øivind Dag
PY  - 2024/2/21
TI  - Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard
JO  - JMIR Med Educ
SP  - e51523
VL  - 10
KW  - accuracy
KW  - AI model
KW  - artificial intelligence
KW  - Bard
KW  - ChatGPT
KW  - educational task
KW  - GPT-4
KW  - Generative Pre-trained Transformers
KW  - large language models
KW  - medical education, medical exam
KW  - natural language processing
KW  - performance
KW  - premedical exams
KW  - suitability
N2  - Background: Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India. Objective: This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. Methods: In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models. Results: It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59. Conclusions: The study?s findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs? performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments. 
UR  - https://mededu.jmir.org/2024/1/e51523
UR  - http://dx.doi.org/10.2196/51523
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38381486
ID  - info:doi/10.2196/51523
ER  - 

TY  - JOUR
AU  - Abid, Areeba
AU  - Murugan, Avinash
AU  - Banerjee, Imon
AU  - Purkayastha, Saptarshi
AU  - Trivedi, Hari
AU  - Gichoya, Judy
PY  - 2024/2/20
TI  - AI Education for Fourth-Year Medical Students: Two-Year Experience of a Web-Based, Self-Guided Curriculum and Mixed Methods Study
JO  - JMIR Med Educ
SP  - e46500
VL  - 10
KW  - medical education
KW  - machine learning
KW  - artificial intelligence
KW  - elective curriculum
KW  - medical student
KW  - student
KW  - students
KW  - elective
KW  - electives
KW  - curricula
KW  - curriculum
KW  - lesson plan
KW  - lesson plans
KW  - educators
KW  - educator
KW  - teacher
KW  - teachers
KW  - teaching
KW  - computer programming
KW  - programming
KW  - coding
KW  - programmer
KW  - programmers
KW  - self guided
KW  - self directed
N2  - Background: Artificial intelligence (AI) and machine learning (ML) are poised to have a substantial impact in the health care space. While a plethora of web-based resources exist to teach programming skills and ML model development, there are few introductory curricula specifically tailored to medical students without a background in data science or programming. Programs that do exist are often restricted to a specific specialty. Objective: We hypothesized that a 1-month elective for fourth-year medical students, composed of high-quality existing web-based resources and a project-based structure, would empower students to learn about the impact of AI and ML in their chosen specialty and begin contributing to innovation in their field of interest. This study aims to evaluate the success of this elective in improving self-reported confidence scores in AI and ML. The authors also share our curriculum with other educators who may be interested in its adoption. Methods: This elective was offered in 2 tracks: technical (for students who were already competent programmers) and nontechnical (with no technical prerequisites, focusing on building a conceptual understanding of AI and ML). Students established a conceptual foundation of knowledge using curated web-based resources and relevant research papers, and were then tasked with completing 3 projects in their chosen specialty: a data set analysis, a literature review, and an AI project proposal. The project-based nature of the elective was designed to be self-guided and flexible to each student?s interest area and career goals. Students? success was measured by self-reported confidence in AI and ML skills in pre and postsurveys. Qualitative feedback on students? experiences was also collected. Results: This web-based, self-directed elective was offered on a pass-or-fail basis each month to fourth-year students at Emory University School of Medicine beginning in May 2021. As of June 2022, a total of 19 students had successfully completed the elective, representing a wide range of chosen specialties: diagnostic radiology (n=3), general surgery (n=1), internal medicine (n=5), neurology (n=2), obstetrics and gynecology (n=1), ophthalmology (n=1), orthopedic surgery (n=1), otolaryngology (n=2), pathology (n=2), and pediatrics (n=1). Students? self-reported confidence scores for AI and ML rose by 66% after this 1-month elective. In qualitative surveys, students overwhelmingly reported enthusiasm and satisfaction with the course and commented that the self-direction and flexibility and the project-based design of the course were essential. Conclusions: Course participants were successful in diving deep into applications of AI in their widely-ranging specialties, produced substantial project deliverables, and generally reported satisfaction with their elective experience. The authors are hopeful that a brief, 1-month investment in AI and ML education during medical school will empower this next generation of physicians to pave the way for AI and ML innovation in health care. 
UR  - https://mededu.jmir.org/2024/1/e46500
UR  - http://dx.doi.org/10.2196/46500
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38376896
ID  - info:doi/10.2196/46500
ER  - 

TY  - JOUR
AU  - Weidener, Lukas
AU  - Fischer, Michael
PY  - 2024/2/9
TI  - Proposing a Principle-Based Approach for Teaching AI Ethics in Medical Education
JO  - JMIR Med Educ
SP  - e55368
VL  - 10
KW  - artificial intelligence
KW  - AI
KW  - ethics
KW  - artificial intelligence ethics
KW  - AI ethics
KW  - medical education
KW  - medicine
KW  - medical artificial intelligence ethics
KW  - medical AI ethics
KW  - medical ethics
KW  - public health ethics
UR  - https://mededu.jmir.org/2024/1/e55368
UR  - http://dx.doi.org/10.2196/55368
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38285931
ID  - info:doi/10.2196/55368
ER  - 

TY  - JOUR
AU  - Gray, Megan
AU  - Baird, Austin
AU  - Sawyer, Taylor
AU  - James, Jasmine
AU  - DeBroux, Thea
AU  - Bartlett, Michelle
AU  - Krick, Jeanne
AU  - Umoren, Rachel
PY  - 2024/2/1
TI  - Increasing Realism and Variety of Virtual Patient Dialogues for Prenatal Counseling Education Through a Novel Application of ChatGPT: Exploratory Observational Study
JO  - JMIR Med Educ
SP  - e50705
VL  - 10
KW  - prenatal counseling
KW  - virtual health
KW  - virtual patient
KW  - simulation
KW  - neonatology
KW  - ChatGPT
KW  - AI
KW  - artificial intelligence
N2  - Background: Using virtual patients, facilitated by natural language processing, provides a valuable educational experience for learners. Generating a large, varied sample of realistic and appropriate responses for virtual patients is challenging. Artificial intelligence (AI) programs can be a viable source for these responses, but their utility for this purpose has not been explored. Objective: In this study, we explored the effectiveness of generative AI (ChatGPT) in developing realistic virtual standardized patient dialogues to teach prenatal counseling skills. Methods: ChatGPT was prompted to generate a list of common areas of concern and questions that families expecting preterm delivery at 24 weeks gestation might ask during prenatal counseling. ChatGPT was then prompted to generate 2 role-plays with dialogues between a parent expecting a potential preterm delivery at 24 weeks and their counseling physician using each of the example questions. The prompt was repeated for 2 unique role-plays: one parent was characterized as anxious and the other as having low trust in the medical system. Role-play scripts were exported verbatim and independently reviewed by 2 neonatologists with experience in prenatal counseling, using a scale of 1-5 on realism, appropriateness, and utility for virtual standardized patient responses. Results: ChatGPT generated 7 areas of concern, with 35 example questions used to generate role-plays. The 35 role-play transcripts generated 176 unique parent responses (median 5, IQR 4-6, per role-play) with 268 unique sentences. Expert review identified 117 (65%) of the 176 responses as indicating an emotion, either directly or indirectly. Approximately half (98/176, 56%) of the responses had 2 or more sentences, and half (88/176, 50%) included at least 1 question. More than half (104/176, 58%) of the responses from role-played parent characters described a feeling, such as being scared, worried, or concerned. The role-plays of parents with low trust in the medical system generated many unique sentences (n=50). Most of the sentences in the responses were found to be reasonably realistic (214/268, 80%), appropriate for variable prenatal counseling conversation paths (233/268, 87%), and usable without more than a minimal modification in a virtual patient program (169/268, 63%). Conclusions: Generative AI programs, such as ChatGPT, may provide a viable source of training materials to expand virtual patient programs, with careful attention to the concerns and questions of patients and families. Given the potential for unrealistic or inappropriate statements and questions, an expert should review AI chat outputs before deploying them in an educational program. 
UR  - https://mededu.jmir.org/2024/1/e50705
UR  - http://dx.doi.org/10.2196/50705
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38300696
ID  - info:doi/10.2196/50705
ER  - 

TY  - JOUR
AU  - Haddad, Firas
AU  - Saade, S. Joanna
PY  - 2024/1/18
TI  - Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study
JO  - JMIR Med Educ
SP  - e50842
VL  - 10
KW  - ChatGPT
KW  - artificial intelligence
KW  - AI
KW  - board examinations
KW  - ophthalmology
KW  - testing
N2  - Background: ChatGPT and language learning models have gained attention recently for their ability to answer questions on various examinations across various disciplines. The question of whether ChatGPT could be used to aid in medical education is yet to be answered, particularly in the field of ophthalmology. Objective: The aim of this study is to assess the ability of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4.0 (GPT-4.0) to answer ophthalmology-related questions across different levels of ophthalmology training. Methods: Questions from the United States Medical Licensing Examination (USMLE) steps 1 (n=44), 2 (n=60), and 3 (n=28) were extracted from AMBOSS, and 248 questions (64 easy, 122 medium, and 62 difficult questions) were extracted from the book, Ophthalmology Board Review Q&A, for the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology (OB) Written Qualifying Examination (WQE). Questions were prompted identically and inputted to GPT-3.5 and GPT-4.0. Results: GPT-3.5 achieved a total of 55% (n=210) of correct answers, while GPT-4.0 achieved a total of 70% (n=270) of correct answers. GPT-3.5 answered 75% (n=33) of questions correctly in USMLE step 1, 73.33% (n=44) in USMLE step 2, 60.71% (n=17) in USMLE step 3, and 46.77% (n=116) in the OB-WQE. GPT-4.0 answered 70.45% (n=31) of questions correctly in USMLE step 1, 90.32% (n=56) in USMLE step 2, 96.43% (n=27) in USMLE step 3, and 62.90% (n=156) in the OB-WQE. GPT-3.5 performed poorer as examination levels advanced (P<.001), while GPT-4.0 performed better on USMLE steps 2 and 3 and worse on USMLE step 1 and the OB-WQE (P<.001). The coefficient of correlation (r) between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 as compared to ?0.31 (P<.001) for GPT-4.0. GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with an increase in the difficulty level. Both GPT models performed significantly better on certain topics than on others. Conclusions: ChatGPT is far from being considered a part of mainstream medical education. Future models with higher accuracy are needed for the platform to be effective in medical education. 
UR  - https://mededu.jmir.org/2024/1/e50842
UR  - http://dx.doi.org/10.2196/50842
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38236632
ID  - info:doi/10.2196/50842
ER  - 

TY  - JOUR
AU  - Kuo, I-Hsien Nicholas
AU  - Perez-Concha, Oscar
AU  - Hanly, Mark
AU  - Mnatzaganian, Emmanuel
AU  - Hao, Brandon
AU  - Di Sipio, Marcus
AU  - Yu, Guolin
AU  - Vanjara, Jash
AU  - Valerie, Cerelia Ivy
AU  - de Oliveira Costa, Juliana
AU  - Churches, Timothy
AU  - Lujic, Sanja
AU  - Hegarty, Jo
AU  - Jorm, Louisa
AU  - Barbieri, Sebastiano
PY  - 2024/1/16
TI  - Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project
JO  - JMIR Med Educ
SP  - e51388
VL  - 10
KW  - medical education
KW  - generative model
KW  - generative adversarial networks
KW  - privacy
KW  - antiretroviral therapy (ART)
KW  - human immunodeficiency virus (HIV)
KW  - data science
KW  - educational purposes
KW  - accessibility
KW  - data privacy
KW  - data sets
KW  - sepsis
KW  - hypotension
KW  - HIV
KW  - science education
KW  - health care AI
UR  - https://mededu.jmir.org/2024/1/e51388
UR  - http://dx.doi.org/10.2196/51388
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38227356
ID  - info:doi/10.2196/51388
ER  - 

TY  - JOUR
AU  - Knoedler, Leonard
AU  - Alfertshofer, Michael
AU  - Knoedler, Samuel
AU  - Hoch, C. Cosima
AU  - Funk, F. Paul
AU  - Cotofana, Sebastian
AU  - Maheta, Bhagvat
AU  - Frank, Konstantin
AU  - Brébant, Vanessa
AU  - Prantl, Lukas
AU  - Lamby, Philipp
PY  - 2024/1/5
TI  - Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
JO  - JMIR Med Educ
SP  - e51148
VL  - 10
KW  - ChatGPT
KW  - United States Medical Licensing Examination
KW  - artificial intelligence
KW  - USMLE
KW  - USMLE Step 1
KW  - OpenAI
KW  - medical education
KW  - clinical decision-making
N2  - Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student?s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT?s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective: This paper aimed to analyze ChatGPT?s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (?=?0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ?=?0.289 for ChatGPT 3.5 and ?=?0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics. 
UR  - https://mededu.jmir.org/2024/1/e51148
UR  - http://dx.doi.org/10.2196/51148
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38180782
ID  - info:doi/10.2196/51148
ER  - 

TY  - JOUR
AU  - Watari, Takashi
AU  - Takagi, Soshi
AU  - Sakaguchi, Kota
AU  - Nishizaki, Yuji
AU  - Shimizu, Taro
AU  - Yamamoto, Yu
AU  - Tokuda, Yasuharu
PY  - 2023/12/6
TI  - Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study
JO  - JMIR Med Educ
SP  - e52202
VL  - 9
KW  - ChatGPT
KW  - artificial intelligence
KW  - medical education
KW  - clinical training
KW  - non-English language
KW  - ChatGPT-4
KW  - Japan
KW  - Japanese
KW  - Asia
KW  - Asian
KW  - exam
KW  - examination
KW  - exams
KW  - examinations
KW  - NLP
KW  - natural language processing
KW  - LLM
KW  - language model
KW  - language models
KW  - performance
KW  - response
KW  - responses
KW  - answer
KW  - answers
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - reasoning
KW  - clinical
KW  - GM-ITE
KW  - self-assessment
KW  - residency programs
N2  - Background: The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. Objective: This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). Methods: We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents? correct response rates. Results: Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the ?specific diseases,? 30.9 points higher in ?obstetrics and gynecology,? and 26.1 points higher in ?internal medicine.? In contrast, GPT-4 scores in ?medical interviewing and professionalism,? ?general practice,? and ?psychiatry? were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference). Conclusions: In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice. 
UR  - https://mededu.jmir.org/2023/1/e52202
UR  - http://dx.doi.org/10.2196/52202
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38055323
ID  - info:doi/10.2196/52202
ER  - 

TY  - JOUR
AU  - Shimizu, Ikuo
AU  - Kasai, Hajime
AU  - Shikino, Kiyoshi
AU  - Araki, Nobuyuki
AU  - Takahashi, Zaiya
AU  - Onodera, Misaki
AU  - Kimura, Yasuhiko
AU  - Tsukamoto, Tomoko
AU  - Yamauchi, Kazuyo
AU  - Asahina, Mayumi
AU  - Ito, Shoichi
AU  - Kawakami, Eiryo
PY  - 2023/11/30
TI  - Developing Medical Education Curriculum Reform Strategies to Address the Impact of Generative AI: Qualitative Study
JO  - JMIR Med Educ
SP  - e53466
VL  - 9
KW  - artificial intelligence
KW  - curriculum reform
KW  - generative artificial intelligence
KW  - large language models
KW  - medical education
KW  - qualitative analysis
KW  - strengths-weaknesses-opportunities-threats (SWOT) framework
N2  - Background: Generative artificial intelligence (GAI), represented by large language models, have the potential to transform health care and medical education. In particular, GAI?s impact on higher education has the potential to change students? learning experience as well as faculty?s teaching. However, concerns have been raised about ethical consideration and decreased reliability of the existing examinations. Furthermore, in medical education, curriculum reform is required to adapt to the revolutionary changes brought about by the integration of GAI into medical practice and research. Objective: This study analyzes the impact of GAI on medical education curricula and explores strategies for adaptation. Methods: The study was conducted in the context of faculty development at a medical school in Japan. A workshop involving faculty and students was organized, and participants were divided into groups to address two research questions: (1) How does GAI affect undergraduate medical education curricula? and (2) How should medical school curricula be reformed to address the impact of GAI? The strength, weakness, opportunity, and threat (SWOT) framework was used, and cross-SWOT matrix analysis was used to devise strategies. Further, 4 researchers conducted content analysis on the data generated during the workshop discussions. Results: The data were collected from 8 groups comprising 55 participants. Further, 5 themes about the impact of GAI on medical education curricula emerged: improvement of teaching and learning, improved access to information, inhibition of existing learning processes, problems in GAI, and changes in physicians? professionality. Positive impacts included enhanced teaching and learning efficiency and improved access to information, whereas negative impacts included concerns about reduced independent thinking and the adaptability of existing assessment methods. Further, GAI was perceived to change the nature of physicians? expertise. Three themes emerged from the cross-SWOT analysis for curriculum reform: (1) learning about GAI, (2) learning with GAI, and (3) learning aside from GAI. Participants recommended incorporating GAI literacy, ethical considerations, and compliance into the curriculum. Learning with GAI involved improving learning efficiency, supporting information gathering and dissemination, and facilitating patient involvement. Learning aside from GAI emphasized maintaining GAI-free learning processes, fostering higher cognitive domains of learning, and introducing more communication exercises. Conclusions: This study highlights the profound impact of GAI on medical education curricula and provides insights into curriculum reform strategies. Participants recognized the need for GAI literacy, ethical education, and adaptive learning. Further, GAI was recognized as a tool that can enhance efficiency and involve patients in education. The study also suggests that medical education should focus on competencies that GAI hardly replaces, such as clinical experience and communication. Notably, involving both faculty and students in curriculum reform discussions fosters a sense of ownership and ensures broader perspectives are encompassed. 
UR  - https://mededu.jmir.org/2023/1/e53466
UR  - http://dx.doi.org/10.2196/53466
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38032695
ID  - info:doi/10.2196/53466
ER  - 

TY  - JOUR
AU  - Surapaneni, Mohan Krishna
PY  - 2023/11/7
TI  - Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study
JO  - JMIR Med Educ
SP  - e47191
VL  - 9
KW  - ChatGPT
KW  - artificial intelligence
KW  - medical education
KW  - medical Biochemistry
KW  - biochemistry
KW  - chatbot
KW  - case study
KW  - case scenario
KW  - medical exam
KW  - medical examination
KW  - computer generated
N2  - Background: ChatGPT has gained global attention recently owing to its high performance in generating a wide range of information and retrieving any kind of data instantaneously. ChatGPT has also been tested for the United States Medical Licensing Examination (USMLE) and has successfully cleared it. Thus, its usability in medical education is now one of the key discussions worldwide. Objective: The objective of this study is to evaluate the performance of ChatGPT in medical biochemistry using clinical case vignettes. Methods: The performance of ChatGPT was evaluated in medical biochemistry using 10 clinical case vignettes. Clinical case vignettes were randomly selected and inputted in ChatGPT along with the response options. We tested the responses for each clinical case twice. The answers generated by ChatGPT were saved and checked using our reference material. Results: ChatGPT generated correct answers for 4 questions on the first attempt. For the other cases, there were differences in responses generated by ChatGPT in the first and second attempts. In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases that were used. But, to our surprise, for case 3, different answers were obtained with multiple attempts. We believe this to have happened owing to the complexity of the case, which involved addressing various critical medical aspects related to amino acid metabolism in a balanced approach. Conclusions: According to the findings of our study, ChatGPT may not be considered an accurate information provider for application in medical education to improve learning and assessment. However, our study was limited by a small sample size (10 clinical case vignettes) and the use of the publicly available version of ChatGPT (version 3.5). Although artificial intelligence (AI) has the capability to transform medical education, we emphasize the validation of such data produced by such AI systems for correctness and dependability before it could be implemented in practice. 
UR  - https://mededu.jmir.org/2023/1/e47191
UR  - http://dx.doi.org/10.2196/47191
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37934568
ID  - info:doi/10.2196/47191
ER  - 

TY  - JOUR
AU  - Ito, Naoki
AU  - Kadomatsu, Sakina
AU  - Fujisawa, Mineto
AU  - Fukaguchi, Kiyomitsu
AU  - Ishizawa, Ryo
AU  - Kanda, Naoki
AU  - Kasugai, Daisuke
AU  - Nakajima, Mikio
AU  - Goto, Tadahiro
AU  - Tsugawa, Yusuke
PY  - 2023/11/2
TI  - The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
JO  - JMIR Med Educ
SP  - e47532
VL  - 9
KW  - GPT-4
KW  - racial and ethnic bias
KW  - typical clinical vignettes
KW  - diagnosis
KW  - triage
KW  - artificial intelligence
KW  - AI
KW  - race
KW  - clinical vignettes
KW  - physician
KW  - efficiency
KW  - decision-making
KW  - bias
KW  - GPT
N2  - Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as ?correct? or ?incorrect.? Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients? race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4?s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. 
UR  - https://mededu.jmir.org/2023/1/e47532
UR  - http://dx.doi.org/10.2196/47532
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37917120
ID  - info:doi/10.2196/47532
ER  - 

TY  - JOUR
AU  - Baglivo, Francesco
AU  - De Angelis, Luigi
AU  - Casigliani, Virginia
AU  - Arzilli, Guglielmo
AU  - Privitera, Pierpaolo Gaetano
AU  - Rizzo, Caterina
PY  - 2023/11/1
TI  - Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study
JO  - JMIR Med Educ
SP  - e51421
VL  - 9
KW  - artificial intelligence
KW  - chatbots
KW  - medical education
KW  - vaccination
KW  - public health
KW  - medical students
KW  - large language model
KW  - generative AI
KW  - ChatGPT
KW  - Google Bard
KW  - AI chatbot
KW  - health education
KW  - health care
KW  - medical training
KW  - educational support tool
KW  - chatbot model
N2  - Background: Artificial intelligence (AI) is a rapidly developing field with the potential to transform various aspects of health care and public health, including medical training. During the ?Hygiene and Public Health? course for fifth-year medical students, a practical training session was conducted on vaccination using AI chatbots as an educational supportive tool. Before receiving specific training on vaccination, the students were given a web-based test extracted from the Italian National Medical Residency Test. After completing the test, a critical correction of each question was performed assisted by AI chatbots. Objective: The main aim of this study was to identify whether AI chatbots can be considered educational support tools for training in public health. The secondary objective was to assess the performance of different AI chatbots on complex multiple-choice medical questions in the Italian language. Methods: A test composed of 15 multiple-choice questions on vaccination was extracted from the Italian National Medical Residency Test using targeted keywords and administered to medical students via Google Forms and to different AI chatbot models (Bing Chat, ChatGPT, Chatsonic, Google Bard, and YouChat). The correction of the test was conducted in the classroom, focusing on the critical evaluation of the explanations provided by the chatbot. A Mann-Whitney U test was conducted to compare the performances of medical students and AI chatbots. Student feedback was collected anonymously at the end of the training experience. Results: In total, 36 medical students and 5 AI chatbot models completed the test. The students achieved an average score of 8.22 (SD 2.65) out of 15, while the AI chatbots scored an average of 12.22 (SD 2.77). The results indicated a statistically significant difference in performance between the 2 groups (U=49.5, P<.001), with a large effect size (r=0.69). When divided by question type (direct, scenario-based, and negative), significant differences were observed in direct (P<.001) and scenario-based (P<.001) questions, but not in negative questions (P=.48). The students reported a high level of satisfaction (7.9/10) with the educational experience, expressing a strong desire to repeat the experience (7.6/10). Conclusions: This study demonstrated the efficacy of AI chatbots in answering complex medical questions related to vaccination and providing valuable educational support. Their performance significantly surpassed that of medical students in direct and scenario-based questions. The responsible and critical use of AI chatbots can enhance medical education, making it an essential aspect to integrate into the educational system. 
UR  - https://mededu.jmir.org/2023/1/e51421
UR  - http://dx.doi.org/10.2196/51421
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37910155
ID  - info:doi/10.2196/51421
ER  - 

TY  - JOUR
AU  - Preiksaitis, Carl
AU  - Rose, Christian
PY  - 2023/10/20
TI  - Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review
JO  - JMIR Med Educ
SP  - e48785
VL  - 9
KW  - medical education
KW  - artificial intelligence
KW  - ChatGPT
KW  - Bard
KW  - AI
KW  - educator
KW  - scoping
KW  - review
KW  - learner
KW  - generative
N2  - Background: Generative artificial intelligence (AI) technologies are increasingly being utilized across various fields, with considerable interest and concern regarding their potential application in medical education. These technologies, such as Chat GPT and Bard, can generate new content and have a wide range of possible applications. Objective: This study aimed to synthesize the potential opportunities and limitations of generative AI in medical education. It sought to identify prevalent themes within recent literature regarding potential applications and challenges of generative AI in medical education and use these to guide future areas for exploration. Methods: We conducted a scoping review, following the framework by Arksey and O'Malley, of English language articles published from 2022 onward that discussed generative AI in the context of medical education. A literature search was performed using PubMed, Web of Science, and Google Scholar databases. We screened articles for inclusion, extracted data from relevant studies, and completed a quantitative and qualitative synthesis of the data. Results: Thematic analysis revealed diverse potential applications for generative AI in medical education, including self-directed learning, simulation scenarios, and writing assistance. However, the literature also highlighted significant challenges, such as issues with academic integrity, data accuracy, and potential detriments to learning. Based on these themes and the current state of the literature, we propose the following 3 key areas for investigation: developing learners? skills to evaluate AI critically, rethinking assessment methodology, and studying human-AI interactions. Conclusions: The integration of generative AI in medical education presents exciting opportunities, alongside considerable challenges. There is a need to develop new skills and competencies related to AI as well as thoughtful, nuanced approaches to examine the growing use of generative AI in medical education. 
UR  - https://mededu.jmir.org/2023/1/e48785/
UR  - http://dx.doi.org/10.2196/48785
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/48785
ER  - 

TY  - JOUR
AU  - Chen, Yanhua
AU  - Wu, Ziye
AU  - Wang, Peicheng
AU  - Xie, Linbo
AU  - Yan, Mengsha
AU  - Jiang, Maoqing
AU  - Yang, Zhenghan
AU  - Zheng, Jianjun
AU  - Zhang, Jingfeng
AU  - Zhu, Jiming
PY  - 2023/10/19
TI  - Radiology Residents? Perceptions of Artificial Intelligence: Nationwide Cross-Sectional Survey Study
JO  - J Med Internet Res
SP  - e48249
VL  - 25
KW  - artificial intelligence
KW  - technology acceptance
KW  - radiology
KW  - residency
KW  - perceptions
KW  - health care services
KW  - resident
KW  - residents
KW  - perception
KW  - adoption
KW  - readiness
KW  - acceptance
KW  - cross sectional
KW  - survey
N2  - Background: Artificial intelligence (AI) is transforming various fields, with health care, especially diagnostic specialties such as radiology, being a key but controversial battleground. However, there is limited research systematically examining the response of ?human intelligence? to AI. Objective: This study aims to comprehend radiologists? perceptions regarding AI, including their views on its potential to replace them, its usefulness, and their willingness to accept it. We examine the influence of various factors, encompassing demographic characteristics, working status, psychosocial aspects, personal experience, and contextual factors. Methods: Between December 1, 2020, and April 30, 2021, a cross-sectional survey was completed by 3666 radiology residents in China. We used multivariable logistic regression models to examine factors and associations, reporting odds ratios (ORs) and 95% CIs. Results: In summary, radiology residents generally hold a positive attitude toward AI, with 29.90% (1096/3666) agreeing that AI may reduce the demand for radiologists, 72.80% (2669/3666) believing AI improves disease diagnosis, and 78.18% (2866/3666) feeling that radiologists should embrace AI. Several associated factors, including age, gender, education, region, eye strain, working hours, time spent on medical images, resilience, burnout, AI experience, and perceptions of residency support and stress, significantly influence AI attitudes. For instance, burnout symptoms were associated with greater concerns about AI replacement (OR 1.89; P<.001), less favorable views on AI usefulness (OR 0.77; P=.005), and reduced willingness to use AI (OR 0.71; P<.001). Moreover, after adjusting for all other factors, perceived AI replacement (OR 0.81; P<.001) and AI usefulness (OR 5.97; P<.001) were shown to significantly impact the intention to use AI. Conclusions: This study profiles radiology residents who are accepting of AI. Our comprehensive findings provide insights for a multidimensional approach to help physicians adapt to AI. Targeted policies, such as digital health care initiatives and medical education, can be developed accordingly. 
UR  - https://www.jmir.org/2023/1/e48249
UR  - http://dx.doi.org/10.2196/48249
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37856181
ID  - info:doi/10.2196/48249
ER  - 

TY  - JOUR
AU  - Hu, Je-Ming
AU  - Liu, Feng-Cheng
AU  - Chu, Chi-Ming
AU  - Chang, Yu-Tien
PY  - 2023/10/18
TI  - Health Care Trainees? and Professionals? Perceptions of ChatGPT in Improving Medical Knowledge Training: Rapid Survey Study
JO  - J Med Internet Res
SP  - e49385
VL  - 25
KW  - ChatGPT
KW  - large language model
KW  - medicine
KW  - perception evaluation
KW  - internet survey
KW  - structural equation modeling
KW  - SEM
N2  - Background: ChatGPT is a powerful pretrained large language model. It has both demonstrated potential and raised concerns related to knowledge translation and knowledge transfer. To apply and improve knowledge transfer in the real world, it is essential to assess the perceptions and acceptance of the users of ChatGPT-assisted training. Objective: We aimed to investigate the perceptions of health care trainees and professionals on ChatGPT-assisted training, using biomedical informatics as an example. Methods: We used purposeful sampling to include all health care undergraduate trainees and graduate professionals (n=195) from January to May 2023 in the School of Public Health at the National Defense Medical Center in Taiwan. Subjects were asked to watch a 2-minute video introducing 5 scenarios about ChatGPT-assisted training in biomedical informatics and then answer a self-designed online (web- and mobile-based) questionnaire according to the Kirkpatrick model. The survey responses were used to develop 4 constructs: ?perceived knowledge acquisition,? ?perceived training motivation,? ?perceived training satisfaction,? and ?perceived training effectiveness.? The study used structural equation modeling (SEM) to evaluate and test the structural model and hypotheses. Results: The online questionnaire response rate was 152 of 195 (78%); 88 of 152 participants (58%) were undergraduate trainees and 90 of 152 participants (59%) were women. The ages ranged from 18 to 53 years (mean 23.3, SD 6.0 years). There was no statistical difference in perceptions of training evaluation between men and women. Most participants were enthusiastic about the ChatGPT-assisted training, while the graduate professionals were more enthusiastic than undergraduate trainees. Nevertheless, some concerns were raised about potential cheating on training assessment. The average scores for knowledge acquisition, training motivation, training satisfaction, and training effectiveness were 3.84 (SD 0.80), 3.76 (SD 0.93), 3.75 (SD 0.87), and 3.72 (SD 0.91), respectively (Likert scale 1-5: strongly disagree to strongly agree). Knowledge acquisition had the highest score and training effectiveness the lowest. In the SEM results, training effectiveness was influenced predominantly by knowledge acquisition and partially met the hypotheses in the research framework. Knowledge acquisition had a direct effect on training effectiveness, training satisfaction, and training motivation, with ? coefficients of .80, .87, and .97, respectively (all P<.001). Conclusions: Most health care trainees and professionals perceived ChatGPT-assisted training as an aid in knowledge transfer. However, to improve training effectiveness, it should be combined with empirical experts for proper guidance and dual interaction. In a future study, we recommend using a larger sample size for evaluation of internet-connected large language models in medical knowledge transfer. 
UR  - https://www.jmir.org/2023/1/e49385
UR  - http://dx.doi.org/10.2196/49385
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37851495
ID  - info:doi/10.2196/49385
ER  - 

TY  - JOUR
AU  - Khlaif, N. Zuheir
AU  - Mousa, Allam
AU  - Hattab, Kamal Muayad
AU  - Itmazi, Jamil
AU  - Hassan, A. Amjad
AU  - Sanmugam, Mageswaran
AU  - Ayyoub, Abedalkarim
PY  - 2023/9/14
TI  - The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation
JO  - JMIR Med Educ
SP  - e47049
VL  - 9
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - scientific research
KW  - research ethics
N2  - Background: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal, education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing (NLP), which refers to the ability of computers to understand and generate human language. Objective: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose, high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing the application?s impact on the research framework, data analysis, and the literature review. The study also explored concerns around ownership and the integrity of research when using AI-generated text. Methods: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchers developed an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated using ChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitative data provided by the reviewers. Results: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality research that could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research framework and data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing. Conclusions: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used in different fields such as medical education to deliver materials to develop the basic competencies for both medicine students and faculty members. 
UR  - https://mededu.jmir.org/2023/1/e47049
UR  - http://dx.doi.org/10.2196/47049
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37707884
ID  - info:doi/10.2196/47049
ER  - 

TY  - JOUR
AU  - Sallam, Malik
AU  - Salim, A. Nesreen
AU  - Barakat, Muna
AU  - Al-Mahzoum, Kholoud
AU  - Al-Tammemi, B. Ala'a
AU  - Malaeb, Diana
AU  - Hallit, Rabih
AU  - Hallit, Souheil
PY  - 2023/9/5
TI  - Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study
JO  - JMIR Med Educ
SP  - e48254
VL  - 9
KW  - artificial intelligence
KW  - machine learning
KW  - education
KW  - technology
KW  - healthcare
KW  - survey
KW  - opinion
KW  - knowledge
KW  - practices
KW  - KAP
N2  - Background: ChatGPT is a conversational large language model that has the potential to revolutionize knowledge acquisition. However, the impact of this technology on the quality of education is still unknown considering the risks and concerns surrounding ChatGPT use. Therefore, it is necessary to assess the usability and acceptability of this promising tool. As an innovative technology, the intention to use ChatGPT can be studied in the context of the technology acceptance model (TAM). Objective: This study aimed to develop and validate a TAM-based survey instrument called TAME-ChatGPT (Technology Acceptance Model Edited to Assess ChatGPT Adoption) that could be employed to examine the successful integration and use of ChatGPT in health care education. Methods: The survey tool was created based on the TAM framework. It comprised 13 items for participants who heard of ChatGPT but did not use it and 23 items for participants who used ChatGPT. Using a convenient sampling approach, the survey link was circulated electronically among university students between February and March 2023. Exploratory factor analysis (EFA) was used to assess the construct validity of the survey instrument. Results: The final sample comprised 458 respondents, the majority among them undergraduate students (n=442, 96.5%). Only 109 (23.8%) respondents had heard of ChatGPT prior to participation and only 55 (11.3%) self-reported ChatGPT use before the study. EFA analysis on the attitude and usage scales showed significant Bartlett tests of sphericity scores (P<.001) and adequate Kaiser-Meyer-Olkin measures (0.823 for the attitude scale and 0.702 for the usage scale), confirming the factorability of the correlation matrices. The EFA showed that 3 constructs explained a cumulative total of 69.3% variance in the attitude scale, and these subscales represented perceived risks, attitude to technology/social influence, and anxiety. For the ChatGPT usage scale, EFA showed that 4 constructs explained a cumulative total of 72% variance in the data and comprised the perceived usefulness, perceived risks, perceived ease of use, and behavior/cognitive factors. All the ChatGPT attitude and usage subscales showed good reliability with Cronbach ? values >.78 for all the deduced subscales. Conclusions: The TAME-ChatGPT demonstrated good reliability, validity, and usefulness in assessing health care students? attitudes toward ChatGPT. The findings highlighted the importance of considering risk perceptions, usefulness, ease of use, attitudes toward technology, and behavioral factors when adopting ChatGPT as a tool in health care education. This information can aid the stakeholders in creating strategies to support the optimal and ethical use of ChatGPT and to identify the potential challenges hindering its successful implementation. Future research is recommended to guide the effective adoption of ChatGPT in health care education. 
UR  - https://mededu.jmir.org/2023/1/e48254
UR  - http://dx.doi.org/10.2196/48254
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37578934
ID  - info:doi/10.2196/48254
ER  - 

TY  - JOUR
AU  - Roos, Jonas
AU  - Kasapovic, Adnan
AU  - Jansen, Tom
AU  - Kaczmarczyk, Robert
PY  - 2023/9/4
TI  - Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
JO  - JMIR Med Educ
SP  - e46482
VL  - 9
KW  - medical education
KW  - state examinations
KW  - exams
KW  - large language models
KW  - artificial intelligence
KW  - ChatGPT
N2  - Background: Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  Objective: This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  Methods: The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  Results: GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  Conclusions: LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.  
UR  - https://mededu.jmir.org/2023/1/e46482
UR  - http://dx.doi.org/10.2196/46482
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37665620
ID  - info:doi/10.2196/46482
ER  - 

TY  - JOUR
AU  - Leung, I. Tiffany
AU  - Sagar, Ankita
AU  - Shroff, Swati
AU  - Henry, L. Tracey
PY  - 2023/8/23
TI  - Can AI Mitigate Bias in Writing Letters of Recommendation?
JO  - JMIR Med Educ
SP  - e51494
VL  - 9
KW  - sponsorship
KW  - implicit bias
KW  - gender bias
KW  - bias
KW  - letters of recommendation
KW  - artificial intelligence
KW  - large language models
KW  - medical education
KW  - career advancement
KW  - tenure and promotion
KW  - promotion
KW  - leadership
UR  - https://mededu.jmir.org/2023/1/e51494
UR  - http://dx.doi.org/10.2196/51494
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37610808
ID  - info:doi/10.2196/51494
ER  - 

TY  - JOUR
AU  - Safranek, W. Conrad
AU  - Sidamon-Eristoff, Elizabeth Anne
AU  - Gilson, Aidan
AU  - Chartash, David
PY  - 2023/8/14
TI  - The Role of Large Language Models in Medical Education: Applications and Implications
JO  - JMIR Med Educ
SP  - e50945
VL  - 9
KW  - large language models
KW  - ChatGPT
KW  - medical education
KW  - LLM
KW  - artificial intelligence in health care
KW  - AI
KW  - autoethnography
UR  - https://mededu.jmir.org/2023/1/e50945
UR  - http://dx.doi.org/10.2196/50945
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37578830
ID  - info:doi/10.2196/50945
ER  - 

TY  - JOUR
AU  - Gilson, Aidan
AU  - Safranek, W. Conrad
AU  - Huang, Thomas
AU  - Socrates, Vimig
AU  - Chi, Ling
AU  - Taylor, Andrew Richard
AU  - Chartash, David
PY  - 2023/7/13
TI  - Authors? Reply to: Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations
JO  - JMIR Med Educ
SP  - e50336
VL  - 9
KW  - natural language processing
KW  - NLP
KW  - MedQA
KW  - generative pre-trained transformer
KW  - GPT
KW  - medical education
KW  - chatbot
KW  - artificial intelligence
KW  - AI
KW  - education technology
KW  - ChatGPT
KW  - conversational agent
KW  - machine learning
KW  - large language models
KW  - knowledge assessment
UR  - https://mededu.jmir.org/2023/1/e50336
UR  - http://dx.doi.org/10.2196/50336
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37440299
ID  - info:doi/10.2196/50336
ER  - 

TY  - JOUR
AU  - Epstein, H. Richard
AU  - Dexter, Franklin
PY  - 2023/7/13
TI  - Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations. Comment on ?How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment?
JO  - JMIR Med Educ
SP  - e48305
VL  - 9
KW  - natural language processing
KW  - NLP
KW  - MedQA
KW  - generative pre-trained transformer
KW  - GPT
KW  - medical education
KW  - chatbot
KW  - artificial intelligence
KW  - AI
KW  - education technology
KW  - ChatGPT
KW  - Google Bard
KW  - conversational agent
KW  - machine learning
KW  - large language models
KW  - knowledge assessment
UR  - https://mededu.jmir.org/2023/1/e48305
UR  - http://dx.doi.org/10.2196/48305
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37440293
ID  - info:doi/10.2196/48305
ER  - 

TY  - JOUR
AU  - Abd-alrazaq, Alaa
AU  - AlSaad, Rawan
AU  - Alhuwail, Dari
AU  - Ahmed, Arfan
AU  - Healy, Mark Padraig
AU  - Latifi, Syed
AU  - Aziz, Sarah
AU  - Damseh, Rafat
AU  - Alabed Alrazak, Sadam
AU  - Sheikh, Javaid
PY  - 2023/6/1
TI  - Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions
JO  - JMIR Med Educ
SP  - e48291
VL  - 9
KW  - large language models
KW  - artificial intelligence
KW  - medical education
KW  - ChatGPT
KW  - GPT-4
KW  - generative AI
KW  - students
KW  - educators
UR  - https://mededu.jmir.org/2023/1/e48291
UR  - http://dx.doi.org/10.2196/48291
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37261894
ID  - info:doi/10.2196/48291
ER  - 

TY  - JOUR
AU  - Thirunavukarasu, James Arun
AU  - Hassan, Refaat
AU  - Mahmood, Shathar
AU  - Sanghera, Rohan
AU  - Barzangi, Kara
AU  - El Mukashfi, Mohanned
AU  - Shah, Sachin
PY  - 2023/4/21
TI  - Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
JO  - JMIR Med Educ
SP  - e46599
VL  - 9
KW  - ChatGPT
KW  - large language model
KW  - natural language processing
KW  - decision support techniques
KW  - artificial intelligence
KW  - AI
KW  - deep learning
KW  - primary care
KW  - general practice
KW  - family medicine
KW  - chatbot
N2  - Background: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model?s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners? reports from 2018 to 2022. Novel explanations from ChatGPT?defined as information provided that was not inputted within the question or multiple answer choices?were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT?s strengths and weaknesses. Results: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT?s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ?=?0.241 and ?0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions: Large language models are approaching human expert?level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis. 
UR  - https://mededu.jmir.org/2023/1/e46599
UR  - http://dx.doi.org/10.2196/46599
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37083633
ID  - info:doi/10.2196/46599
ER  - 

TY  - JOUR
AU  - Sabry Abdel-Messih, Mary
AU  - Kamel Boulos, N. Maged
PY  - 2023/3/8
TI  - ChatGPT in Clinical Toxicology
JO  - JMIR Med Educ
SP  - e46876
VL  - 9
KW  - ChatGPT
KW  - clinical toxicology
KW  - organophosphates
KW  - artificial intelligence
KW  - AI
KW  - medical education
UR  - https://mededu.jmir.org/2023/1/e46876
UR  - http://dx.doi.org/10.2196/46876
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36867743
ID  - info:doi/10.2196/46876
ER  - 

TY  - JOUR
AU  - Eysenbach, Gunther
PY  - 2023/3/6
TI  - The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers
JO  - JMIR Med Educ
SP  - e46885
VL  - 9
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - generative language model
KW  - medical education
KW  - interview
KW  - future of education
UR  - https://mededu.jmir.org/2023/1/e46885
UR  - http://dx.doi.org/10.2196/46885
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36863937
ID  - info:doi/10.2196/46885
ER  - 

TY  - JOUR
AU  - Gilson, Aidan
AU  - Safranek, W. Conrad
AU  - Huang, Thomas
AU  - Socrates, Vimig
AU  - Chi, Ling
AU  - Taylor, Andrew Richard
AU  - Chartash, David
PY  - 2023/2/8
TI  - How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment
JO  - JMIR Med Educ
SP  - e45312
VL  - 9
KW  - natural language processing
KW  - NLP
KW  - MedQA
KW  - generative pre-trained transformer
KW  - GPT
KW  - medical education
KW  - chatbot
KW  - artificial intelligence
KW  - education technology
KW  - ChatGPT
KW  - conversational agent
KW  - machine learning
KW  - USMLE
N2  - Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT?s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT?s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT?s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT?s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. 
UR  - https://mededu.jmir.org/2023/1/e45312
UR  - http://dx.doi.org/10.2196/45312
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36753318
ID  - info:doi/10.2196/45312
ER  - 

TY  - JOUR
AU  - Grunhut, Joel
AU  - Marques, Oge
AU  - Wyatt, M. Adam T.
PY  - 2022/6/7
TI  - Needs, Challenges, and Applications of Artificial Intelligence in Medical Education Curriculum
JO  - JMIR Med Educ
SP  - e35587
VL  - 8
IS  - 2
KW  - artificial intelligence
KW  - AI
KW  - medical education
KW  - medical student
UR  - https://mededu.jmir.org/2022/2/e35587
UR  - http://dx.doi.org/10.2196/35587
UR  - http://www.ncbi.nlm.nih.gov/pubmed/35671077
ID  - info:doi/10.2196/35587
ER  - 

TY  - JOUR
AU  - Gray, Kathleen
AU  - Slavotinek, John
AU  - Dimaguila, Luis Gerardo
AU  - Choo, Dawn
PY  - 2022/4/4
TI  - Artificial Intelligence Education for the Health Workforce: Expert Survey of Approaches and Needs
JO  - JMIR Med Educ
SP  - e35223
VL  - 8
IS  - 2
KW  - artificial intelligence
KW  - curriculum
KW  - ethics
KW  - human-computer interaction
KW  - interprofessional education
KW  - machine learning
KW  - natural language processing
KW  - professional development
KW  - robotics
N2  - Background: The preparation of the current and future health workforce for the possibility of using artificial intelligence (AI) in health care is a growing concern as AI applications emerge in various care settings and specializations. At present, there is no obvious consensus among educators about what needs to be learned or how this learning may be supported or assessed. Objective: Our study aims to explore health care education experts? ideas and plans for preparing the health workforce to work with AI and identify critical gaps in curriculum and educational resources across a national health care system. Methods: A survey canvassed expert views on AI education for the health workforce in terms of educational strategies, subject matter priorities, meaningful learning activities, desired attitudes, and skills. A total of 39 senior people from different health workforce subgroups across Australia provided ratings and free-text responses in late 2020. Results: The responses highlighted the importance of education on ethical implications, suitability of large data sets for use in AI clinical applications, principles of machine learning, and specific diagnosis and treatment applications of AI as well as alterations to cognitive load during clinical work and the interaction between humans and machines in clinical settings. Respondents also outlined barriers to implementation, such as lack of governance structures and processes, resource constraints, and cultural adjustment. Conclusions: Further work around the world of the kind reported in this survey can assist educators and education authorities who are responsible for preparing the health workforce to minimize the risks and realize the benefits of implementing AI in health care. 
UR  - https://mededu.jmir.org/2022/2/e35223
UR  - http://dx.doi.org/10.2196/35223
UR  - http://www.ncbi.nlm.nih.gov/pubmed/35249885
ID  - info:doi/10.2196/35223
ER  - 

TY  - JOUR
AU  - Teng, Minnie
AU  - Singla, Rohit
AU  - Yau, Olivia
AU  - Lamoureux, Daniel
AU  - Gupta, Aurinjoy
AU  - Hu, Zoe
AU  - Hu, Ricky
AU  - Aissiou, Amira
AU  - Eaton, Shane
AU  - Hamm, Camille
AU  - Hu, Sophie
AU  - Kelly, Dayton
AU  - MacMillan, M. Kathleen
AU  - Malik, Shamir
AU  - Mazzoli, Vienna
AU  - Teng, Yu-Wen
AU  - Laricheva, Maria
AU  - Jarus, Tal
AU  - Field, S. Thalia
PY  - 2022/1/31
TI  - Health Care Students? Perspectives on Artificial Intelligence: Countrywide Survey in Canada
JO  - JMIR Med Educ
SP  - e33390
VL  - 8
IS  - 1
KW  - medical education
KW  - artificial intelligence
KW  - allied health education
KW  - medical students
KW  - health care students
KW  - medical curriculum
KW  - education
N2  - Background: Artificial intelligence (AI) is no longer a futuristic concept; it is increasingly being integrated into health care. As studies on attitudes toward AI have primarily focused on physicians, there is a need to assess the perspectives of students across health care disciplines to inform future curriculum development. Objective: This study aims to explore and identify gaps in the knowledge that Canadian health care students have regarding AI, capture how health care students in different fields differ in their knowledge and perspectives on AI, and present student-identified ways that AI literacy may be incorporated into the health care curriculum. Methods: The survey was developed from a narrative literature review of topics in attitudinal surveys on AI. The final survey comprised 15 items, including multiple-choice questions, pick-group-rank questions, 11-point Likert scale items, slider scale questions, and narrative questions. We used snowball and convenience sampling methods by distributing an email with a description and a link to the web-based survey to representatives from 18 Canadian schools. Results: A total of 2167 students across 10 different health professions from 18 universities across Canada responded to the survey. Overall, 78.77% (1707/2167) predicted that AI technology would affect their careers within the coming decade and 74.5% (1595/2167) reported a positive outlook toward the emerging role of AI in their respective fields. Attitudes toward AI varied by discipline. Students, even those opposed to AI, identified the need to incorporate a basic understanding of AI into their curricula. Conclusions: We performed a nationwide survey of health care students across 10 different health professions in Canada. The findings would inform student-identified topics within AI and their preferred delivery formats, which would advance education across different health care professions. 
UR  - https://mededu.jmir.org/2022/1/e33390
UR  - http://dx.doi.org/10.2196/33390
UR  - http://www.ncbi.nlm.nih.gov/pubmed/35099397
ID  - info:doi/10.2196/33390
ER  - 

TY  - JOUR
AU  - Charow, Rebecca
AU  - Jeyakumar, Tharshini
AU  - Younus, Sarah
AU  - Dolatabadi, Elham
AU  - Salhia, Mohammad
AU  - Al-Mouaswas, Dalia
AU  - Anderson, Melanie
AU  - Balakumar, Sarmini
AU  - Clare, Megan
AU  - Dhalla, Azra
AU  - Gillan, Caitlin
AU  - Haghzare, Shabnam
AU  - Jackson, Ethan
AU  - Lalani, Nadim
AU  - Mattson, Jane
AU  - Peteanu, Wanda
AU  - Tripp, Tim
AU  - Waldorf, Jacqueline
AU  - Williams, Spencer
AU  - Tavares, Walter
AU  - Wiljer, David
PY  - 2021/12/13
TI  - Artificial Intelligence Education Programs for Health Care Professionals: Scoping Review
JO  - JMIR Med Educ
SP  - e31043
VL  - 7
IS  - 4
KW  - machine learning
KW  - deep learning
KW  - health care providers
KW  - education
KW  - learning
KW  - patient care
N2  - Background: As the adoption of artificial intelligence (AI) in health care increases, it will become increasingly crucial to involve health care professionals (HCPs) in developing, validating, and implementing AI-enabled technologies. However, because of a lack of AI literacy, most HCPs are not adequately prepared for this revolution. This is a significant barrier to adopting and implementing AI that will affect patients. In addition, the limited existing AI education programs face barriers to development and implementation at various levels of medical education. Objective: With a view to informing future AI education programs for HCPs, this scoping review aims to provide an overview of the types of current or past AI education programs that pertains to the programs? curricular content, modes of delivery, critical implementation factors for education delivery, and outcomes used to assess the programs? effectiveness. Methods: After the creation of a search strategy and keyword searches, a 2-stage screening process was conducted by 2 independent reviewers to determine study eligibility. When consensus was not reached, the conflict was resolved by consulting a third reviewer. This process consisted of a title and abstract scan and a full-text review. The articles were included if they discussed an actual training program or educational intervention, or a potential training program or educational intervention and the desired content to be covered, focused on AI, and were designed or intended for HCPs (at any stage of their career). Results: Of the 10,094 unique citations scanned, 41 (0.41%) studies relevant to our eligibility criteria were identified. Among the 41 included studies, 10 (24%) described 13 unique programs and 31 (76%) discussed recommended curricular content. The curricular content of the unique programs ranged from AI use, AI interpretation, and cultivating skills to explain results derived from AI algorithms. The curricular topics were categorized into three main domains: cognitive, psychomotor, and affective. Conclusions: This review provides an overview of the current landscape of AI in medical education and highlights the skills and competencies required by HCPs to effectively use AI in enhancing the quality of care and optimizing patient outcomes. Future education efforts should focus on the development of regulatory strategies, a multidisciplinary approach to curriculum redesign, a competency-based curriculum, and patient-clinician interaction. 
UR  - https://mededu.jmir.org/2021/4/e31043
UR  - http://dx.doi.org/10.2196/31043
UR  - http://www.ncbi.nlm.nih.gov/pubmed/34898458
ID  - info:doi/10.2196/31043
ER  - 

TY  - JOUR
AU  - Sapci, Hasan A.
AU  - Sapci, Aylin H.
PY  - 2020/6/30
TI  - Artificial Intelligence Education and Tools for Medical and Health Informatics Students: Systematic Review
JO  - JMIR Med Educ
SP  - e19285
VL  - 6
IS  - 1
KW  - artificial intelligence
KW  - education
KW  - machine learning
KW  - deep learning
KW  - medical education
KW  - health informatics
KW  - systematic review
N2  - Background: The use of artificial intelligence (AI) in medicine will generate numerous application possibilities to improve patient care, provide real-time data analytics, and enable continuous patient monitoring. Clinicians and health informaticians should become familiar with machine learning and deep learning. Additionally, they should have a strong background in data analytics and data visualization to use, evaluate, and develop AI applications in clinical practice. Objective: The main objective of this study was to evaluate the current state of AI training and the use of AI tools to enhance the learning experience. Methods: A comprehensive systematic review was conducted to analyze the use of AI in medical and health informatics education, and to evaluate existing AI training practices. PRISMA-P (Preferred Reporting Items for Systematic Reviews and Meta-Analysis Protocols) guidelines were followed. The studies that focused on the use of AI tools to enhance medical education and the studies that investigated teaching AI as a new competency were categorized separately to evaluate recent developments. Results: This systematic review revealed that recent publications recommend the integration of AI training into medical and health informatics curricula. Conclusions: To the best of our knowledge, this is the first systematic review exploring the current state of AI education in both medicine and health informatics. Since AI curricula have not been standardized and competencies have not been determined, a framework for specialized AI training in medical and health informatics education is proposed. 
UR  - http://mededu.jmir.org/2020/1/e19285/
UR  - http://dx.doi.org/10.2196/19285
UR  - http://www.ncbi.nlm.nih.gov/pubmed/32602844
ID  - info:doi/10.2196/19285
ER  - 

TY  - JOUR
AU  - Paranjape, Ketan
AU  - Schinkel, Michiel
AU  - Nannan Panday, Rishi
AU  - Car, Josip
AU  - Nanayakkara, Prabath
PY  - 2019/12/3
TI  - Introducing Artificial Intelligence Training in Medical Education
JO  - JMIR Med Educ
SP  - e16048
VL  - 5
IS  - 2
KW  - algorithm
KW  - artificial intelligence
KW  - black box
KW  - deep learning
KW  - machine learning
KW  - medical education
KW  - continuing education
KW  - data sciences
KW  - curriculum
UR  - http://mededu.jmir.org/2019/2/e16048/
UR  - http://dx.doi.org/10.2196/16048
UR  - http://www.ncbi.nlm.nih.gov/pubmed/31793895
ID  - info:doi/10.2196/16048
ER  - 

TY  - JOUR
AU  - Chan, Siang Kai
AU  - Zary, Nabil
PY  - 2019/6/15
TI  - Applications and Challenges of Implementing Artificial Intelligence in Medical Education: Integrative Review
JO  - JMIR Med Educ
SP  - e13930
VL  - 5
IS  - 1
KW  - medical education
KW  - evaluation of AIED systems
KW  - real world applications of AIED systems
KW  - artificial intelligence
N2  - Background: Since the advent of artificial intelligence (AI) in 1955, the applications of AI have increased over the years within a rapidly changing digital landscape where public expectations are on the rise, fed by social media, industry leaders, and medical practitioners. However, there has been little interest in AI in medical education until the last two decades, with only a recent increase in the number of publications and citations in the field. To our knowledge, thus far, a limited number of articles have discussed or reviewed the current use of AI in medical education. Objective: This study aims to review the current applications of AI in medical education as well as the challenges of implementing AI in medical education. Methods: Medline (Ovid), EBSCOhost Education Resources Information Center (ERIC) and Education Source, and Web of Science were searched with explicit inclusion and exclusion criteria. Full text of the selected articles was analyzed using the Extension of Technology Acceptance Model and the Diffusions of Innovations theory. Data were subsequently pooled together and analyzed quantitatively. Results: A total of 37 articles were identified. Three primary uses of AI in medical education were identified: learning support (n=32), assessment of students? learning (n=4), and curriculum review (n=1). The main reasons for use of AI are its ability to provide feedback and a guided learning pathway and to decrease costs. Subgroup analysis revealed that medical undergraduates are the primary target audience for AI use. In addition, 34 articles described the challenges of AI implementation in medical education; two main reasons were identified: difficulty in assessing the effectiveness of AI in medical education and technical challenges while developing AI applications. Conclusions: The primary use of AI in medical education was for learning support mainly due to its ability to provide individualized feedback. Little emphasis was placed on curriculum review and assessment of students? learning due to the lack of digitalization and sensitive nature of examinations, respectively. Big data manipulation also warrants the need to ensure data integrity. Methodological improvements are required to increase AI adoption by addressing the technical difficulties of creating an AI application and using novel methods to assess the effectiveness of AI. To better integrate AI into the medical profession, measures should be taken to introduce AI into the medical school curriculum for medical professionals to better understand AI algorithms and maximize its use. 
UR  - http://mededu.jmir.org/2019/1/e13930/
UR  - http://dx.doi.org/10.2196/13930
UR  - http://www.ncbi.nlm.nih.gov/pubmed/31199295
ID  - info:doi/10.2196/13930
ER  -