TY - JOUR AU - Quon, Stephanie AU - Zhou, Sarah PY - 2025/4/11 TI - Enhancing AI-Driven Medical Translations: Considerations for Language Concordance JO - JMIR Med Educ SP - e70420 VL - 11 KW - letter to the editor KW - ChatGPT KW - AI KW - artificial intelligence KW - language KW - translation KW - health care disparity KW - natural language model KW - survey KW - patient education KW - accessibility KW - preference KW - human language KW - communication KW - language-concordant care UR - https://mededu.jmir.org/2025/1/e70420 UR - http://dx.doi.org/10.2196/70420 ID - info:doi/10.2196/70420 ER - TY - JOUR AU - Teng, Joyce AU - Novoa, Andres Roberto AU - Aleshin, Alexandrovna Maria AU - Lester, Jenna AU - Seiger, Kira AU - Dzuali, Fiatsogbe AU - Daneshjou, Roxana PY - 2025/4/11 TI - Authors? Reply: Enhancing AI-Driven Medical Translations: Considerations for Language Concordance JO - JMIR Med Educ SP - e71721 VL - 11 KW - ChatGPT KW - artificial intelligence KW - language KW - translation KW - health care disparity KW - natural language model KW - survey KW - patient education KW - accessibility KW - preference KW - human language KW - communication KW - language-concordant care UR - https://mededu.jmir.org/2025/1/e71721 UR - http://dx.doi.org/10.2196/71721 ID - info:doi/10.2196/71721 ER - TY - JOUR AU - K?yak, Selim Yavuz AU - Kononowicz, A. Andrzej PY - 2025/4/4 TI - Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG JO - JMIR Form Res SP - e65726 VL - 9 KW - automatic item generation KW - ChatGPT KW - artificial intelligence KW - large language models KW - medical education KW - AI KW - hybrid KW - template-based method KW - hybrid AIG KW - mixed-method KW - multiple-choice question KW - multiple-choice KW - human-AI collaboration KW - human-AI KW - algorithm KW - expert N2 - Background: Template-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items. Objective: We aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education. Methods: This is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population. Results: The hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging. Conclusions: The hybrid AIG method transcends the traditional template-based approach by marrying the ?art? that comes from AI as a ?black box? with the ?science? of algorithmic generation under the oversight of expert as a ?marriage registrar?. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education. UR - https://formative.jmir.org/2025/1/e65726 UR - http://dx.doi.org/10.2196/65726 ID - info:doi/10.2196/65726 ER - TY - JOUR AU - Cook, A. David AU - Overgaard, Joshua AU - Pankratz, Shane V. AU - Del Fiol, Guilherme AU - Aakre, A. Chris PY - 2025/4/4 TI - Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback JO - J Med Internet Res SP - e68486 VL - 27 KW - simulation training KW - natural language processing KW - computer-assisted instruction KW - clinical decision-making KW - clinical reasoning KW - machine learning KW - virtual patient KW - natural language generation N2 - Background: Virtual patients (VPs) are computer screen?based simulations of patient-clinician encounters. VP use is limited by cost and low scalability. Objective: We aimed to show that VPs powered by large language models (LLMs) can generate authentic dialogues, accurately represent patient preferences, and provide personalized feedback on clinical performance. We also explored using LLMs to rate the quality of dialogues and feedback. Methods: We conducted an intrinsic evaluation study rating 60 VP-clinician conversations. We used carefully engineered prompts to direct OpenAI?s generative pretrained transformer (GPT) to emulate a patient and provide feedback. Using 2 outpatient medicine topics (chronic cough diagnosis and diabetes management), each with permutations representing different patient preferences, we created 60 conversations (dialogues plus feedback): 48 with a human clinician and 12 ?self-chat? dialogues with GPT role-playing both the VP and clinician. Primary outcomes were dialogue authenticity and feedback quality, rated using novel instruments for which we conducted a validation study collecting evidence of content, internal structure (reproducibility), relations with other variables, and response process. Each conversation was rated by 3 physicians and by GPT. Secondary outcomes included user experience, bias, patient preferences represented in the dialogues, and conversation features that influenced authenticity. Results: The average cost per conversation was US $0.51 for GPT-4.0-Turbo and US $0.02 for GPT-3.5-Turbo. Mean (SD) conversation ratings, maximum 6, were overall dialogue authenticity 4.7 (0.7), overall user experience 4.9 (0.7), and average feedback quality 4.7 (0.6). For dialogues created using GPT-4.0-Turbo, physician ratings of patient preferences aligned with intended preferences in 20 to 47 of 48 dialogues (42%-98%). Subgroup comparisons revealed higher ratings for dialogues using GPT-4.0-Turbo versus GPT-3.5-Turbo and for human-generated versus self-chat dialogues. Feedback ratings were similar for human-generated versus GPT-generated ratings, whereas authenticity ratings were lower. We did not perceive bias in any conversation. Dialogue features that detracted from authenticity included that GPT was verbose or used atypical vocabulary (93/180, 51.7% of conversations), was overly agreeable (n=56, 31%), repeated the question as part of the response (n=47, 26%), was easily convinced by clinician suggestions (n=35, 19%), or was not disaffected by poor clinician performance (n=32, 18%). For feedback, detractors included excessively positive feedback (n=42, 23%), failure to mention important weaknesses or strengths (n=41, 23%), or factual inaccuracies (n=39, 22%). Regarding validation of dialogue and feedback scores, items were meticulously developed (content evidence), and we confirmed expected relations with other variables (higher ratings for advanced LLMs and human-generated dialogues). Reproducibility was suboptimal, due largely to variation in LLM performance rather than rater idiosyncrasies. Conclusions: LLM-powered VPs can simulate patient-clinician dialogues, demonstrably represent patient preferences, and provide personalized performance feedback. This approach is scalable, globally accessible, and inexpensive. LLM-generated ratings of feedback quality are similar to human ratings. UR - https://www.jmir.org/2025/1/e68486 UR - http://dx.doi.org/10.2196/68486 UR - http://www.ncbi.nlm.nih.gov/pubmed/39854611 ID - info:doi/10.2196/68486 ER - TY - JOUR AU - Zhang, Manlin AU - Zhao, Tianyu PY - 2025/4/2 TI - Citation Accuracy Challenges Posed by Large Language Models JO - JMIR Med Educ SP - e72998 VL - 11 KW - chatGPT KW - medical education KW - Saudi Arabia KW - perceptions KW - knowledge KW - medical students KW - faculty KW - chatbot KW - qualitative study KW - artificial intelligence KW - AI KW - AI-based tools KW - universities KW - thematic analysis KW - learning KW - satisfaction KW - LLM KW - large language model UR - https://mededu.jmir.org/2025/1/e72998 UR - http://dx.doi.org/10.2196/72998 ID - info:doi/10.2196/72998 ER - TY - JOUR AU - Temsah, Mohamad-Hani AU - Al-Eyadhy, Ayman AU - Jamal, Amr AU - Alhasan, Khalid AU - Malki, H. Khalid PY - 2025/4/2 TI - Authors? Reply: Citation Accuracy Challenges Posed by Large Language Models JO - JMIR Med Educ SP - e73698 VL - 11 KW - ChatGPT KW - Gemini KW - DeepSeek KW - medical education KW - AI KW - artificial intelligence KW - Saudi Arabia KW - perceptions KW - medical students KW - faculty KW - LLM KW - chatbot KW - qualitative study KW - thematic analysis KW - satisfaction KW - RAG retrieval-augmented generation UR - https://mededu.jmir.org/2025/1/e73698 UR - http://dx.doi.org/10.2196/73698 ID - info:doi/10.2196/73698 ER - TY - JOUR AU - Yan, Zelin AU - Liu, Jingwen AU - Fan, Yihong AU - Lu, Shiyuan AU - Xu, Dingting AU - Yang, Yun AU - Wang, Honggang AU - Mao, Jie AU - Tseng, Hou-Chiang AU - Chang, Tao-Hsing AU - Chen, Yan PY - 2025/3/31 TI - Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease JO - J Med Internet Res SP - e62857 VL - 27 KW - AI-assisted KW - patient education KW - inflammatory bowel disease KW - artificial intelligence KW - ChatGPT KW - patient communities KW - social media KW - disease management KW - readability KW - online health information KW - conversational agents N2 - Background: Although large language models (LLMs) such as ChatGPT show promise for providing specialized information, their quality requires further evaluation. This is especially true considering that these models are trained on internet text and the quality of health-related information available online varies widely. Objective: The aim of this study was to evaluate the performance of ChatGPT in the context of patient education for individuals with chronic diseases, comparing it with that of industry experts to elucidate its strengths and limitations. Methods: This evaluation was conducted in September 2023 by analyzing the responses of ChatGPT and specialist doctors to questions posed by patients with inflammatory bowel disease (IBD). We compared their performance in terms of subjective accuracy, empathy, completeness, and overall quality, as well as readability to support objective analysis. Results: In a series of 1578 binary choice assessments, ChatGPT was preferred in 48.4% (95% CI 45.9%-50.9%) of instances. There were 12 instances where ChatGPT?s responses were unanimously preferred by all evaluators, compared with 17 instances for specialist doctors. In terms of overall quality, there was no significant difference between the responses of ChatGPT (3.98, 95% CI 3.93-4.02) and those of specialist doctors (3.95, 95% CI 3.90-4.00; t524=0.95, P=.34), both being considered ?good.? Although differences in accuracy (t521=0.48, P=.63) and empathy (t511=2.19, P=.03) lacked statistical significance, the completeness of textual output (t509=9.27, P<.001) was a distinct advantage of the LLM (ChatGPT). In the sections of the questionnaire where patients and doctors responded together (Q223-Q242), ChatGPT demonstrated inferior performance (t36=2.91, P=.006). Regarding readability, no statistical difference was found between the responses of specialist doctors (median: 7th grade; Q1: 4th grade; Q3: 8th grade) and those of ChatGPT (median: 7th grade; Q1: 7th grade; Q3: 8th grade) according to the Mann-Whitney U test (P=.09). The overall quality of ChatGPT?s output exhibited strong correlations with other subdimensions (with empathy: r=0.842; with accuracy: r=0.839; with completeness: r=0.795), and there was also a high correlation between the subdimensions of accuracy and completeness (r=0.762). Conclusions: ChatGPT demonstrated more stable performance across various dimensions. Its output of health information content is more structurally sound, addressing the issue of variability in the information from individual specialist doctors. ChatGPT?s performance highlights its potential as an auxiliary tool for health information, despite limitations such as artificial intelligence hallucinations. It is recommended that patients be involved in the creation and evaluation of health information to enhance the quality and relevance of the information. UR - https://www.jmir.org/2025/1/e62857 UR - http://dx.doi.org/10.2196/62857 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/62857 ER - TY - JOUR AU - Madrid, Julian AU - Diehl, Philipp AU - Selig, Mischa AU - Rolauffs, Bernd AU - Hans, Patricius Felix AU - Busch, Hans-Jörg AU - Scheef, Tobias AU - Benning, Leo PY - 2025/3/21 TI - Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination JO - JMIR Med Educ SP - e58375 VL - 11 KW - medical education KW - artificial intelligence KW - generative AI KW - large language model KW - LLM KW - ChatGPT KW - GPT-4 KW - board licensing examination KW - professional education KW - examination KW - student KW - experimental KW - bootstrapping KW - confidence interval N2 - Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed ?confidence accuracy? to evaluate it. Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain. UR - https://mededu.jmir.org/2025/1/e58375 UR - http://dx.doi.org/10.2196/58375 ID - info:doi/10.2196/58375 ER - TY - JOUR AU - Andalib, Saman AU - Spina, Aidin AU - Picton, Bryce AU - Solomon, S. Sean AU - Scolaro, A. John AU - Nelson, M. Ariana PY - 2025/3/21 TI - Using AI to Translate and Simplify Spanish Orthopedic Medical Text: Instrument Validation Study JO - JMIR AI SP - e70222 VL - 4 KW - large language models KW - LLM KW - patient education KW - translation KW - bilingual evaluation understudy KW - GPT-4 KW - Google Translate N2 - Background: Language barriers contribute significantly to health care disparities in the United States, where a sizable proportion of patients are exclusively Spanish speakers. In orthopedic surgery, such barriers impact both patients? comprehension of and patients? engagement with available resources. Studies have explored the utility of large language models (LLMs) for medical translation but have yet to robustly evaluate artificial intelligence (AI)?driven translation and simplification of orthopedic materials for Spanish speakers. Objective: This study used the bilingual evaluation understudy (BLEU) method to assess translation quality and investigated the ability of AI to simplify patient education materials (PEMs) in Spanish. Methods: PEMs (n=78) from the American Academy of Orthopaedic Surgery were translated from English to Spanish, using 2 LLMs (GPT-4 and Google Translate). The BLEU methodology was applied to compare AI translations with professionally human-translated PEMs. The Friedman test and Dunn multiple comparisons test were used to statistically quantify differences in translation quality. A readability analysis and feature analysis were subsequently performed to evaluate text simplification success and the impact of English text features on BLEU scores. The capability of an LLM to simplify medical language written in Spanish was also assessed. Results: As measured by BLEU scores, GPT-4 showed moderate success in translating PEMs into Spanish but was less successful than Google Translate. Simplified PEMs demonstrated improved readability when compared to original versions (P<.001) but were unable to reach the targeted grade level for simplification. The feature analysis revealed that the total number of syllables and average number of syllables per sentence had the highest impact on BLEU scores. GPT-4 was able to significantly reduce the complexity of medical text written in Spanish (P<.001). Conclusions: Although Google Translate outperformed GPT-4 in translation accuracy, LLMs, such as GPT-4, may provide significant utility in translating medical texts into Spanish and simplifying such texts. We recommend considering a dual approach?using Google Translate for translation and GPT-4 for simplification?to improve medical information accessibility and orthopedic surgery education among Spanish-speaking patients. UR - https://ai.jmir.org/2025/1/e70222 UR - http://dx.doi.org/10.2196/70222 ID - info:doi/10.2196/70222 ER - TY - JOUR AU - Tseng, Liang-Wei AU - Lu, Yi-Chin AU - Tseng, Liang-Chi AU - Chen, Yu-Chun AU - Chen, Hsing-Yu PY - 2025/3/19 TI - Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study JO - JMIR Med Educ SP - e58897 VL - 11 KW - artificial intelligence KW - AI language understanding tools KW - ChatGPT KW - natural language processing KW - machine learning KW - Chinese medicine license exam KW - Chinese medical licensing examination KW - medical education KW - traditional Chinese medicine KW - large language model N2 - Background: The integration of artificial intelligence (AI), notably ChatGPT, into medical education, has shown promising results in various medical fields. Nevertheless, its efficacy in traditional Chinese medicine (TCM) examinations remains understudied. Objective: This study aims to (1) assess the performance of ChatGPT on the TCM licensing examination in Taiwan and (2) evaluate the model?s explainability in answering TCM-related questions to determine its suitability as a TCM learning tool. Methods: We used the GPT-4 model to respond to 480 questions from the 2022 TCM licensing examination. This study compared the performance of the model against that of licensed TCM doctors using 2 approaches, namely direct answer selection and provision of explanations before answer selection. The accuracy and consistency of AI-generated responses were analyzed. Moreover, a breakdown of question characteristics was performed based on the cognitive level, depth of knowledge, types of questions, vignette style, and polarity of questions. Results: ChatGPT achieved an overall accuracy of 43.9%, which was lower than that of 2 human participants (70% and 78.4%). The analysis did not reveal a significant correlation between the accuracy of the model and the characteristics of the questions. An in-depth examination indicated that errors predominantly resulted from a misunderstanding of TCM concepts (55.3%), emphasizing the limitations of the model with regard to its TCM knowledge base and reasoning capability. Conclusions: Although ChatGPT shows promise as an educational tool, its current performance on TCM licensing examinations is lacking. This highlights the need for enhancing AI models with specialized TCM training and suggests a cautious approach to utilizing AI for TCM education. Future research should focus on model improvement and the development of tailored educational applications to support TCM learning. UR - https://mededu.jmir.org/2025/1/e58897 UR - http://dx.doi.org/10.2196/58897 ID - info:doi/10.2196/58897 ER - TY - JOUR AU - Pastrak, Mila AU - Kajitani, Sten AU - Goodings, James Anthony AU - Drewek, Austin AU - LaFree, Andrew AU - Murphy, Adrian PY - 2025/3/12 TI - Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study JO - JMIR AI SP - e67696 VL - 4 KW - artificial intelligence KW - ChatGPT-4 KW - medical education KW - emergency medicine KW - examination KW - examination preparation N2 - Background: The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear. Objective: This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation. Methods: A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed. Results: The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability. Conclusions: The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form. UR - https://ai.jmir.org/2025/1/e67696 UR - http://dx.doi.org/10.2196/67696 ID - info:doi/10.2196/67696 ER - TY - JOUR AU - Monzon, Noahlana AU - Hays, Alan Franklin PY - 2025/3/11 TI - Leveraging Generative Artificial Intelligence to Improve Motivation and Retrieval in Higher Education Learners JO - JMIR Med Educ SP - e59210 VL - 11 KW - educational technology KW - retrieval practice KW - flipped classroom KW - cognitive engagement KW - personalized learning KW - generative artificial intelligence KW - higher education KW - university education KW - learners KW - instructors KW - curriculum structure KW - learning KW - technologies KW - innovation KW - academic misconduct KW - gamification KW - self-directed KW - socio-economic disparities KW - interactive approach KW - medical education KW - chatGPT KW - machine learning KW - AI KW - large language models UR - https://mededu.jmir.org/2025/1/e59210 UR - http://dx.doi.org/10.2196/59210 ID - info:doi/10.2196/59210 ER - TY - JOUR AU - Zada, Troy AU - Tam, Natalie AU - Barnard, Francois AU - Van Sittert, Marlize AU - Bhat, Venkat AU - Rambhatla, Sirisha PY - 2025/3/10 TI - Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models JO - JMIR Form Res SP - e66207 VL - 9 KW - ChatGPT KW - health care KW - LLM KW - misinformation KW - self-diagnosis KW - large language model N2 - Background: Rapid integration of large language models (LLMs) in health care is sparking global discussion about their potential to revolutionize health care quality and accessibility. At a time when improving health care quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical examinations is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading health care misinformation has not been evaluated. Objective: This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models. Methods: We propose the comprehensive testing methodology evaluation of LLM prompts (EvalPrompt). This evaluation methodology uses multiple-choice medical licensing examination questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and experiment 2 performs sentence dropout on the correct responses from experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT. Results: In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts, with only 34% (32/94) agreement between the 2 groups. Similarly, in experiment 2, which assessed robustness, 61% (92/152) of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed. Conclusions: The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in health care systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields. UR - https://formative.jmir.org/2025/1/e66207 UR - http://dx.doi.org/10.2196/66207 ID - info:doi/10.2196/66207 ER - TY - JOUR AU - Kammies, Chamandra AU - Archer, Elize AU - Engel-Hills, Penelope AU - Volschenk, Mariette PY - 2025/3/6 TI - Exploring Curriculum Considerations to Prepare Future Radiographers for an AI-Assisted Health Care Environment: Protocol for Scoping Review JO - JMIR Res Protoc SP - e60431 VL - 14 KW - artificial intelligence KW - machine learning KW - radiography KW - education KW - scoping review N2 - Background: The use of artificial intelligence (AI) technologies in radiography practice is increasing. As this advanced technology becomes more embedded in radiography systems and clinical practice, the role of radiographers will evolve. In the context of these anticipated changes, it may be reasonable to expect modifications to the competencies and educational requirements of current and future practitioners to ensure successful AI adoption. Objective: The aim of this scoping review is to explore and synthesize the literature on the adjustments needed in the radiography curriculum to prepare radiography students for the demands of AI-assisted health care environments. Methods: Using the Joanna Briggs Institute methodology, an initial search was run in Scopus to determine whether the search strategy that was developed with a library specialist would capture the relevant literature by screening the title and abstract of the first 50 articles. Additional search terms identified in the articles were added to the search strategy. Next, EBSCOhost, PubMed, and Web of Science databases were searched. In total, 2 reviewers will independently review the title, abstract, and full-text articles according to the predefined inclusion and exclusion criteria, with conflicts resolved by a third reviewer. Results: The search results will be reported using the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) checklist. The final scoping review will present the data analysis as findings in tabular form and through narrative descriptions. The final database searches were completed in October 2024 and yielded 2224 records. Title and abstract screening of 1930 articles is underway after removing 294 duplicates. The scoping review is expected to be finalized by the end of March 2025. Conclusions: A scoping review aims to systematically map the evidence on the adjustments needed in the radiography curriculum to prepare radiography students for the integration of AI technologies in the health care environment. It is relevant to map the evidence because increased integration of AI-based technologies in clinical practice has been noted and changes in practice must be underpinned by appropriate education and training. The findings in this study will provide a better understanding of how the radiography curriculum should adapt to meet the educational needs of current and future radiographers to ensure competent and safe practice in response to AI technologies. Trial Registration: Open Science Framework 3nx2a; https://osf.io/3nx2a International Registered Report Identifier (IRRID): PRR1-10.2196/60431 UR - https://www.researchprotocols.org/2025/1/e60431 UR - http://dx.doi.org/10.2196/60431 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053777 ID - info:doi/10.2196/60431 ER - TY - JOUR AU - Prazeres, Filipe PY - 2025/3/5 TI - ChatGPT?s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini JO - JMIR Med Educ SP - e65108 VL - 11 KW - ChatGPT-3.5 Turbo KW - ChatGPT-4o mini KW - medical examination KW - European Portuguese KW - AI performance evaluation KW - Portuguese KW - evaluation KW - medical examination questions KW - examination question KW - chatbot KW - ChatGPT KW - model KW - artificial intelligence KW - AI KW - GPT KW - LLM KW - NLP KW - natural language processing KW - machine learning KW - large language model N2 - Background: Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness. Objective: This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso ā Formaįão Especializada [PNA]) and compares their performance to human candidates. Methods: ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, ?Are you sure?? after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models? performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05. Results: ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance. Conclusions: This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research. UR - https://mededu.jmir.org/2025/1/e65108 UR - http://dx.doi.org/10.2196/65108 ID - info:doi/10.2196/65108 ER - TY - JOUR AU - Doru, Berin AU - Maier, Christoph AU - Busse, Sophie Johanna AU - Lücke, Thomas AU - Schönhoff, Judith AU - Enax- Krumova, Elena AU - Hessler, Steffen AU - Berger, Maria AU - Tokic, Marianne PY - 2025/3/3 TI - Detecting Artificial Intelligence?Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study JO - JMIR Med Educ SP - e62779 VL - 11 KW - artificial intelligence KW - ChatGPT KW - large language models KW - textual analysis KW - writing style KW - AI KW - chatbot KW - LLMs KW - detection KW - authorship KW - medical student KW - linguistic quality KW - decision-making KW - logical coherence N2 - Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)?generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount. Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity?medical professionals and humanities scholars with expertise in textual analysis?to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features. Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants? characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text?s authorship. Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features?particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)?played a crucial role in participants? decisions to identify a text as AI-generated. Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts? familiarity with the text content. As the decision-making process primarily relies on linguistic attributes?such as stylistic features and text coherence?further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers? ability to distinguish between student-authored and AI-generated work. UR - https://mededu.jmir.org/2025/1/e62779 UR - http://dx.doi.org/10.2196/62779 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053752 ID - info:doi/10.2196/62779 ER - TY - JOUR AU - Scherr, Riley AU - Spina, Aidin AU - Dao, Allen AU - Andalib, Saman AU - Halaseh, F. Faris AU - Blair, Sarah AU - Wiechmann, Warren AU - Rivera, Ronald PY - 2025/2/27 TI - Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study JO - JMIR Form Res SP - e66478 VL - 9 KW - medical school simulations KW - AI in medical education KW - preclinical curriculum KW - ChatGPT KW - ChatGPT-4 KW - medical simulation KW - simulation KW - multimedia KW - feedback KW - medical education KW - medical student KW - clinical education KW - pilot study KW - patient management N2 - Background: Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT?s reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms. Objective: This study aims to quantify ChatGPT?s ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology. Methods: Using ChatGPT-4 and a prevalidated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. A total of 180 simulations were given correct answers and 180 simulations were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with ?˛ analyses using 95% CIs for odds ratios. Results: In total, 100% (n=360) of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% (200/360) of all simulations delayed feedback, while the Correct arm (157/180, 87%) delayed feedback was significantly more than the Incorrect arm (43/180, 24%; P<.001). A total of 79% (285/360) of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (146/180, 81% and 139/180, 77%; P=.36). Overall, 78% (282/360) of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (137/180, 76% and 145/180, 81%; P=.31). ChatGPT-4 was not significantly more likely to conclude simulations autonomously (P=.34) and provide comprehensive feedback (P=.27) when feedback was delayed compared to when feedback was not delayed. Conclusions: These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel 9-part metric. Per this metric, ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was not more likely to meet all advanced parameters. Further work must be done to ensure consistent performance across a broader range of simulation scenarios. UR - https://formative.jmir.org/2025/1/e66478 UR - http://dx.doi.org/10.2196/66478 ID - info:doi/10.2196/66478 ER - TY - JOUR AU - Abouammoh, Noura AU - Alhasan, Khalid AU - Aljamaan, Fadi AU - Raina, Rupesh AU - Malki, H. Khalid AU - Altamimi, Ibraheem AU - Muaygil, Ruaim AU - Wahabi, Hayfaa AU - Jamal, Amr AU - Alhaboob, Ali AU - Assiri, Assad Rasha AU - Al-Tawfiq, A. Jaffar AU - Al-Eyadhy, Ayman AU - Soliman, Mona AU - Temsah, Mohamad-Hani PY - 2025/2/20 TI - Perceptions and Earliest Experiences of Medical Students and Faculty With ChatGPT in Medical Education: Qualitative Study JO - JMIR Med Educ SP - e63400 VL - 11 KW - ChatGPT KW - medical education KW - Saudi Arabia KW - perceptions KW - knowledge KW - medical students KW - faculty KW - chatbot KW - qualitative study KW - artificial intelligence KW - AI KW - AI-based tools KW - universities KW - thematic analysis KW - learning KW - satisfaction N2 - Background: With the rapid development of artificial intelligence technologies, there is a growing interest in the potential use of artificial intelligence?based tools like ChatGPT in medical education. However, there is limited research on the initial perceptions and experiences of faculty and students with ChatGPT, particularly in Saudi Arabia. Objective: This study aimed to explore the earliest knowledge, perceived benefits, concerns, and limitations of using ChatGPT in medical education among faculty and students at a leading Saudi Arabian university. Methods: A qualitative exploratory study was conducted in April 2023, involving focused meetings with medical faculty and students with varying levels of ChatGPT experience. A thematic analysis was used to identify key themes and subthemes emerging from the discussions. Results: Participants demonstrated good knowledge of ChatGPT and its functions. The main themes were perceptions of ChatGPT use, potential benefits, and concerns about ChatGPT in research and medical education. The perceived benefits included collecting and summarizing information and saving time and effort. However, concerns and limitations centered around the potential lack of critical thinking in the information provided, the ambiguity of references, limitations of access, trust in the output of ChatGPT, and ethical concerns. Conclusions: This study provides valuable insights into the perceptions and experiences of medical faculty and students regarding the use of newly introduced large language models like ChatGPT in medical education. While the benefits of ChatGPT were recognized, participants also expressed concerns and limitations requiring further studies for effective integration into medical education, exploring the impact of ChatGPT on learning outcomes, student and faculty satisfaction, and the development of critical thinking skills. UR - https://mededu.jmir.org/2025/1/e63400 UR - http://dx.doi.org/10.2196/63400 UR - http://www.ncbi.nlm.nih.gov/pubmed/39977012 ID - info:doi/10.2196/63400 ER - TY - JOUR AU - Potter, Alison AU - Munsch, Chris AU - Watson, Elaine AU - Hopkins, Emily AU - Kitromili, Sofia AU - O'Neill, Cameron Iain AU - Larbie, Judy AU - Niittymaki, Essi AU - Ramsay, Catriona AU - Burke, Joshua AU - Ralph, Neil PY - 2025/2/19 TI - Identifying Research Priorities in Digital Education for Health Care: Umbrella Review and Modified Delphi Method Study JO - J Med Internet Res SP - e66157 VL - 27 KW - digital education KW - health professions education KW - research priorities KW - umbrella review KW - Delphi KW - artificial intelligence KW - AI N2 - Background: In recent years, the use of digital technology in the education of health care professionals has surged, partly driven by the COVID-19 pandemic. However, there is still a need for focused research to establish evidence of its effectiveness. Objective: This study aimed to define the gaps in the evidence for the efficacy of digital education and to identify priority areas where future research has the potential to contribute to our understanding and use of digital education. Methods: We used a 2-stage approach to identify research priorities. First, an umbrella review of the recent literature (published between 2020 and 2023) was performed to identify and build on existing work. Second, expert consensus on the priority research questions was obtained using a modified Delphi method. Results: A total of 8857 potentially relevant papers were identified. Using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology, we included 217 papers for full review. All papers were either systematic reviews or meta-analyses. A total of 151 research recommendations were extracted from the 217 papers. These were analyzed, recategorized, and consolidated to create a final list of 63 questions. From these, a modified Delphi process with 42 experts was used to produce the top-five rated research priorities: (1) How do we measure the learning transfer from digital education into the clinical setting? (2) How can we optimize the use of artificial intelligence, machine learning, and deep learning to facilitate education and training? (3) What are the methodological requirements for high-quality rigorous studies assessing the outcomes of digital health education? (4) How does the design of digital education interventions (eg, format and modality) in health professionals? education and training curriculum affect learning outcomes? and (5) How should learning outcomes in the field of health professions? digital education be defined and standardized? Conclusions: This review provides a prioritized list of research gaps in digital education in health care, which will be of use to researchers, educators, education providers, and funding agencies. Additional proposals are discussed regarding the next steps needed to advance this agenda, aiming to promote meaningful and practical research on the use of digital technologies and drive excellence in health care education. UR - https://www.jmir.org/2025/1/e66157 UR - http://dx.doi.org/10.2196/66157 UR - http://www.ncbi.nlm.nih.gov/pubmed/39969988 ID - info:doi/10.2196/66157 ER - TY - JOUR AU - Chow, L. James C. AU - Li, Kay PY - 2025/2/18 TI - Developing Effective Frameworks for Large Language Model?Based Medical Chatbots: Insights From Radiotherapy Education With ChatGPT JO - JMIR Cancer SP - e66633 VL - 11 KW - artificial intelligence KW - AI KW - AI in medical education KW - radiotherapy chatbot KW - large language models KW - LLMs KW - medical chatbots KW - health care AI KW - ethical AI in health care KW - personalized learning KW - natural language processing KW - NLP KW - radiotherapy education KW - AI-driven learning tools UR - https://cancer.jmir.org/2025/1/e66633 UR - http://dx.doi.org/10.2196/66633 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/66633 ER - TY - JOUR AU - Ichikawa, Tsunagu AU - Olsen, Elizabeth AU - Vinod, Arathi AU - Glenn, Noah AU - Hanna, Karim AU - Lund, C. Gregg AU - Pierce-Talsma, Stacey PY - 2025/2/11 TI - Generative Artificial Intelligence in Medical Education?Policies and Training at US Osteopathic Medical Schools: Descriptive Cross-Sectional Survey JO - JMIR Med Educ SP - e58766 VL - 11 KW - artificial intelligence KW - medical education KW - faculty development KW - policy KW - AI KW - training KW - United States KW - school KW - university KW - college KW - institution KW - osteopathic KW - osteopathy KW - curriculum KW - student KW - faculty KW - administrator KW - survey KW - cross-sectional N2 - Background: Interest has recently increased in generative artificial intelligence (GenAI), a subset of artificial intelligence that can create new content. Although the publicly available GenAI tools are not specifically trained in the medical domain, they have demonstrated proficiency in a wide range of medical assessments. The future integration of GenAI in medicine remains unknown. However, the rapid availability of GenAI with a chat interface and the potential risks and benefits are the focus of great interest. As with any significant medical advancement or change, medical schools must adapt their curricula to equip students with the skills necessary to become successful physicians. Furthermore, medical schools must ensure that faculty members have the skills to harness these new opportunities to increase their effectiveness as educators. How medical schools currently fulfill their responsibilities is unclear. Colleges of Osteopathic Medicine (COMs) in the United States currently train a significant proportion of the total number of medical students. These COMs are in academic settings ranging from large public research universities to small private institutions. Therefore, studying COMs will offer a representative sample of the current GenAI integration in medical education. Objective: This study aims to describe the policies and training regarding the specific aspect of GenAI in US COMs, targeting students, faculty, and administrators. Methods: Web-based surveys were sent to deans and Student Government Association (SGA) presidents of the main campuses of fully accredited US COMs. The dean survey included questions regarding current and planned policies and training related to GenAI for students, faculty, and administrators. The SGA president survey included only those questions related to current student policies and training. Results: Responses were received from 81% (26/32) of COMs surveyed. This included 47% (15/32) of the deans and 50% (16/32) of the SGA presidents (with 5 COMs represented by both the deans and the SGA presidents). Most COMs did not have a policy on the student use of GenAI, as reported by the dean (14/15, 93%) and the SGA president (14/16, 88%). Of the COMs with no policy, 79% (11/14) had no formal plans for policy development. Only 1 COM had training for students, which focused entirely on the ethics of using GenAI. Most COMs had no formal plans to provide mandatory (11/14, 79%) or elective (11/15, 73%) training. No COM had GenAI policies for faculty or administrators. Eighty percent had no formal plans for policy development. Furthermore, 33.3% (5/15) of COMs had faculty or administrator GenAI training. Except for examination question development, there was no training to increase faculty or administrator capabilities and efficiency or to decrease their workload. Conclusions: The survey revealed that most COMs lack GenAI policies and training for students, faculty, and administrators. The few institutions with policies or training were extremely limited in scope. Most institutions without current training or policies had no formal plans for development. The lack of current policies and training initiatives suggests inadequate preparedness for integrating GenAI into the medical school environment, therefore, relegating the responsibility for ethical guidance and training to the individual COM member. UR - https://mededu.jmir.org/2025/1/e58766 UR - http://dx.doi.org/10.2196/58766 ID - info:doi/10.2196/58766 ER - TY - JOUR AU - Burisch, Christian AU - Bellary, Abhav AU - Breuckmann, Frank AU - Ehlers, Jan AU - Thal, C. Serge AU - Sellmann, Timur AU - Gödde, Daniel PY - 2025/2/6 TI - ChatGPT-4 Performance on German Continuing Medical Education?Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial JO - JMIR Res Protoc SP - e63887 VL - 14 KW - ChatGPT KW - artificial intelligence KW - large language model KW - postgraduate education KW - continuing medical education KW - self-assessment program N2 - Background: The increasing development and spread of artificial and assistive intelligence is opening up new areas of application not only in applied medicine but also in related fields such as continuing medical education (CME), which is part of the mandatory training program for medical doctors in Germany. This study aimed to determine whether medical laypersons can successfully conduct training courses specifically for physicians with the help of a large language model (LLM) such as ChatGPT-4. This study aims to qualitatively and quantitatively investigate the impact of using artificial intelligence (AI; specifically ChatGPT) on the acquisition of credit points in German postgraduate medical education. Objective: Using this approach, we wanted to test further possible applications of AI in the postgraduate medical education setting and obtain results for practical use. Depending on the results, the potential influence of LLMs such as ChatGPT-4 on CME will be discussed, for example, as part of a SWOT (strengths, weaknesses, opportunities, threats) analysis. Methods: We designed a randomized controlled trial, in which adult high school students attempt to solve CME tests across six medical specialties in three study arms in total with 18 CME training courses per study arm under different interventional conditions with varying amounts of permitted use of ChatGPT-4. Sample size calculation was performed including guess probability (20% correct answers, SD=40%; confidence level of 1??=.95/?=.05; test power of 1??=.95; P<.05). The study was registered at open scientific framework. Results: As of October 2024, the acquisition of data and students to participate in the trial is ongoing. Upon analysis of our acquired data, we predict our findings to be ready for publication as soon as early 2025. Conclusions: We aim to prove that the advances in AI, especially LLMs such as ChatGPT-4 have considerable effects on medical laypersons? ability to successfully pass CME tests. The implications that this holds on how the concept of continuous medical education requires reevaluation are yet to be contemplated. Trial Registration: OSF Registries 10.17605/OSF.IO/MZNUF; https://osf.io/mznuf International Registered Report Identifier (IRRID): PRR1-10.2196/63887 UR - https://www.researchprotocols.org/2025/1/e63887 UR - http://dx.doi.org/10.2196/63887 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63887 ER - TY - JOUR AU - Gazquez-Garcia, Javier AU - Sánchez-Bocanegra, Luis Carlos AU - Sevillano, Luis Jose PY - 2025/2/5 TI - AI in the Health Sector: Systematic Review of Key Skills for Future Health Professionals JO - JMIR Med Educ SP - e58161 VL - 11 KW - artificial intelligence KW - healthcare competencies KW - systematic review KW - healthcare education KW - AI regulation N2 - Background: Technological advancements have significantly reshaped health care, introducing digital solutions that enhance diagnostics and patient care. Artificial intelligence (AI) stands out, offering unprecedented capabilities in data analysis, diagnostic support, and personalized medicine. However, effectively integrating AI into health care necessitates specialized competencies among professionals, an area still in its infancy in terms of comprehensive literature and formalized training programs. Objective: This systematic review aims to consolidate the essential skills and knowledge health care professionals need to integrate AI into their clinical practice effectively, according to the published literature. Methods: We conducted a systematic review, across databases PubMed, Scopus, and Web of Science, of peer-reviewed literature that directly explored the required skills for health care professionals to integrate AI into their practice, published in English or Spanish from 2018 onward. Studies that did not refer to specific skills or training in digital health were not included, discarding those that did not directly contribute to understanding the competencies necessary to integrate AI into health care practice. Bias in the examined works was evaluated following Cochrane?s domain-based recommendations. Results: The initial database search yielded a total of 2457 articles. After deleting duplicates and screening titles and abstracts, 37 articles were selected for full-text review. Out of these, only 7 met all the inclusion criteria for this systematic review. The review identified a diverse range of skills and competencies, that we categorized into 14 key areas classified based on their frequency of appearance in the selected studies, including AI fundamentals, data analytics and management, and ethical considerations. Conclusions: Despite the broadening of search criteria to capture the evolving nature of AI in health care, the review underscores a significant gap in focused studies on the required competencies. Moreover, the review highlights the critical role of regulatory bodies such as the US Food and Drug Administration in facilitating the adoption of AI technologies by establishing trust and standardizing algorithms. Key areas were identified for developing competencies among health care professionals for the implementation of AI, including: AI fundamentals knowledge (more focused on assessing the accuracy, reliability, and validity of AI algorithms than on more technical abilities such as programming or mathematics), data analysis skills (including data acquisition, cleaning, visualization, management, and governance), and ethical and legal considerations. In an AI-enhanced health care landscape, the ability to humanize patient care through effective communication is paramount. This balance ensures that while AI streamlines tasks and potentially increases patient interaction time, health care professionals maintain a focus on compassionate care, thereby leveraging AI to enhance, rather than detract from, the patient experience.? UR - https://mededu.jmir.org/2025/1/e58161 UR - http://dx.doi.org/10.2196/58161 ID - info:doi/10.2196/58161 ER - TY - JOUR AU - Elhassan, Elwaleed Safia AU - Sajid, Raihan Muhammad AU - Syed, Mariam Amina AU - Fathima, Afreen Sidrah AU - Khan, Shehroz Bushra AU - Tamim, Hala PY - 2025/1/30 TI - Assessing Familiarity, Usage Patterns, and Attitudes of Medical Students Toward ChatGPT and Other Chat-Based AI Apps in Medical Education: Cross-Sectional Questionnaire Study JO - JMIR Med Educ SP - e63065 VL - 11 KW - ChatGPT KW - artificial intelligence KW - large language model KW - medical students KW - ethics KW - chat-based KW - AI apps KW - medical education KW - social media KW - attitude KW - AI N2 - Background: There has been a rise in the popularity of ChatGPT and other chat-based artificial intelligence (AI) apps in medical education. Despite data being available from other parts of the world, there is a significant lack of information on this topic in medical education and research, particularly in Saudi Arabia. Objective: The primary objective of the study was to examine the familiarity, usage patterns, and attitudes of Alfaisal University medical students toward ChatGPT and other chat-based AI apps in medical education. Methods: This was a cross-sectional study conducted from October 8, 2023, through November 22, 2023. A questionnaire was distributed through social media channels to medical students at Alfaisal University who were 18 years or older. Current Alfaisal University medical students in years 1 through 6, of both genders, were exclusively targeted by the questionnaire. The study was approved by Alfaisal University Institutional Review Board. A ?2 test was conducted to assess the relationships between gender, year of study, familiarity, and reasons for usage. Results: A total of 293 responses were received, of which 95 (32.4%) were from men and 198 (67.6%) were from women. There were 236 (80.5%) responses from preclinical students and 57 (19.5%) from clinical students, respectively. Overall, males (n=93, 97.9%) showed more familiarity with ChatGPT compared to females (n=180, 90.09%; P=.03). Additionally, males also used Google Bard and Microsoft Bing ChatGPT more than females (P<.001). Clinical-year students used ChatGPT significantly more for general writing purposes compared to preclinical students (P=.005). Additionally, 136 (46.4%) students believed that using ChatGPT and other chat-based AI apps for coursework was ethical, 86 (29.4%) were neutral, and 71 (24.2%) considered it unethical (all Ps>.05). Conclusions: Familiarity with and usage of ChatGPT and other chat-based AI apps were common among the students of Alfaisal University. The usage patterns of these apps differ between males and females and between preclinical and clinical-year students. UR - https://mededu.jmir.org/2025/1/e63065 UR - http://dx.doi.org/10.2196/63065 ID - info:doi/10.2196/63065 ER - TY - JOUR AU - Li, Rui AU - Wu, Tong PY - 2025/1/30 TI - Evolution of Artificial Intelligence in Medical Education From 2000 to 2024: Bibliometric Analysis JO - Interact J Med Res SP - e63775 VL - 14 KW - artificial intelligence KW - medical education KW - bibliometric KW - citation trends KW - academic pattern KW - VOSviewer KW - Citespace KW - AI N2 - Background: Incorporating artificial intelligence (AI) into medical education has gained significant attention for its potential to enhance teaching and learning outcomes. However, it lacks a comprehensive study depicting the academic performance and status of AI in the medical education domain. Objective: This study aims to analyze the social patterns, productive contributors, knowledge structure, and clusters since the 21st century. Methods: Documents were retrieved from the Web of Science Core Collection database from 2000 to 2024. VOSviewer, Incites, and Citespace were used to analyze the bibliometric metrics, which were categorized by country, institution, authors, journals, and keywords. The variables analyzed encompassed counts, citations, H-index, impact factor, and collaboration metrics. Results: Altogether, 7534 publications were initially retrieved and 2775 were included for analysis. The annual count and citation of papers exhibited exponential trends since 2018. The United States emerged as the lead contributor due to its high productivity and recognition levels. Stanford University, Johns Hopkins University, National University of Singapore, Mayo Clinic, University of Arizona, and University of Toronto were representative institutions in their respective fields. Cureus, JMIR Medical Education, Medical Teacher, and BMC Medical Education ranked as the top four most productive journals. The resulting heat map highlighted several high-frequency keywords, including performance, education, AI, and model. The citation burst time of terms revealed that AI technologies shifted from imaging processing (2000), augmented reality (2013), and virtual reality (2016) to decision-making (2020) and model (2021). Keywords such as mortality and robotic surgery persisted into 2023, suggesting the ongoing recognition and interest in these areas. Conclusions: This study provides valuable insights and guidance for researchers who are interested in educational technology, as well as recommendations for pioneering institutions and journal submissions. Along with the rapid growth of AI, medical education is expected to gain much more benefits. UR - https://www.i-jmr.org/2025/1/e63775 UR - http://dx.doi.org/10.2196/63775 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63775 ER - TY - JOUR AU - Taira, Kazuya AU - Itaya, Takahiro AU - Yada, Shuntaro AU - Hiyama, Kirara AU - Hanada, Ayame PY - 2025/1/22 TI - Impact of Attached File Formats on the Performance of ChatGPT-4 on the Japanese National Nursing Examination: Evaluation Study JO - JMIR Nursing SP - e67197 VL - 8 KW - nursing examination KW - machine learning KW - ML KW - artificial intelligence KW - AI KW - large language models KW - ChatGPT KW - generative AI N2 - Abstract: This research letter discusses the impact of different file formats on ChatGPT-4?s performance on the Japanese National Nursing Examination, highlighting the need for standardized reporting protocols to enhance the integration of artificial intelligence in nursing education and practice. UR - https://nursing.jmir.org/2025/1/e67197 UR - http://dx.doi.org/10.2196/67197 ID - info:doi/10.2196/67197 ER - TY - JOUR AU - Wei, Boxiong PY - 2025/1/16 TI - Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis JO - JMIR Med Educ SP - e64284 VL - 11 KW - large language models KW - LLM KW - artificial intelligence KW - AI KW - GPT-4 KW - radiology exams KW - medical education KW - diagnostics KW - medical training KW - radiology KW - ultrasound N2 - Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using ?2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18?0.60) for Claude, 0.24 (95% CI 0.13?0.44) for Bard, and 0.25 (95% CI 0.14?0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27?0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models? effectiveness in specialized fields like radiology. UR - https://mededu.jmir.org/2025/1/e64284 UR - http://dx.doi.org/10.2196/64284 ID - info:doi/10.2196/64284 ER - TY - JOUR AU - Kim, JaeYong AU - Vajravelu, Narayan Bathri PY - 2025/1/16 TI - Assessing the Current Limitations of Large Language Models in Advancing Health Care Education JO - JMIR Form Res SP - e51319 VL - 9 KW - large language model KW - generative pretrained transformer KW - health care education KW - health care delivery KW - artificial intelligence KW - LLM KW - ChatGPT KW - AI UR - https://formative.jmir.org/2025/1/e51319 UR - http://dx.doi.org/10.2196/51319 ID - info:doi/10.2196/51319 ER - TY - JOUR AU - Kaewboonlert, Naritsaret AU - Poontananggul, Jiraphon AU - Pongsuwan, Natthipong AU - Bhakdisongkhram, Gun PY - 2025/1/13 TI - Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study JO - JMIR Med Educ SP - e58898 VL - 11 KW - accuracy KW - performance KW - artificial intelligence KW - AI KW - ChatGPT KW - large language model KW - LLM KW - difficulty index KW - basic medical science examination KW - cross-sectional study KW - medical education KW - datasets KW - assessment KW - medical science KW - tool KW - Google N2 - Background: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. Objective: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. Methods: We used questions that were closely aligned with the content and topic distribution of Thailand?s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). Results: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%?92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%?87.80%), GPT-3.5 at 67.02% (95% CI 61.20%?72.48%), and Google Bard at 63.83% (95% CI 57.92%?69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. Conclusions: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item?s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts. UR - https://mededu.jmir.org/2025/1/e58898 UR - http://dx.doi.org/10.2196/58898 ID - info:doi/10.2196/58898 ER - TY - JOUR AU - Rjoop, Anwar AU - Al-Qudah, Mohammad AU - Alkhasawneh, Raja AU - Bataineh, Nesreen AU - Abdaljaleel, Maram AU - Rjoub, A. Moayad AU - Alkhateeb, Mustafa AU - Abdelraheem, Mohammad AU - Al-Omari, Salem AU - Bani-Mari, Omar AU - Alkabalan, Anas AU - Altulaih, Saoud AU - Rjoub, Iyad AU - Alshimi, Rula PY - 2025/1/10 TI - Awareness and Attitude Toward Artificial Intelligence Among Medical Students and Pathology Trainees: Survey Study JO - JMIR Med Educ SP - e62669 VL - 11 KW - artificial intelligence KW - AI KW - deep learning KW - medical schools KW - pathology KW - Jordan KW - medical education KW - awareness KW - attitude KW - medical students KW - pathology trainees KW - national survey study KW - medical practice KW - training KW - web-based survey KW - survey KW - questionnaire N2 - Background: Artificial intelligence (AI) is set to shape the future of medical practice. The perspective and understanding of medical students are critical for guiding the development of educational curricula and training. Objective: This study aims to assess and compare medical AI-related attitudes among medical students in general medicine and in one of the visually oriented fields (pathology), along with illuminating their anticipated role of AI in the rapidly evolving landscape of AI-enhanced health care. Methods: This was a cross-sectional study that used a web-based survey composed of a closed-ended questionnaire. The survey addressed medical students at all educational levels across the 5 public medical schools, along with pathology residents in 4 residency programs in Jordan. Results: A total of 394 respondents participated (328 medical students and 66 pathology residents). The majority of respondents (272/394, 69%) were already aware of AI and deep learning in medicine, mainly relying on websites for information on AI, while only 14% (56/394) were aware of AI through medical schools. There was a statistically significant difference in awareness among respondents who consider themselves tech experts compared with those who do not (P=.03). More than half of the respondents believed that AI could be used to diagnose diseases automatically (213/394, 54.1% agreement), with medical students agreeing more than pathology residents (P=.04). However, more than one-third expressed fear about recent AI developments (167/394, 42.4% agreed). Two-thirds of respondents disagreed that their medical schools had educated them about AI and its potential use (261/394, 66.2% disagreed), while 46.2% (182/394) expressed interest in learning about AI in medicine. In terms of pathology-specific questions, 75.4% (297/394) agreed that AI could be used to identify pathologies in slide examinations automatically. There was a significant difference between medical students and pathology residents in their agreement (P=.001). Overall, medical students and pathology trainees had similar responses. Conclusions: AI education should be introduced into medical school curricula to improve medical students? understanding and attitudes. Students agreed that they need to learn about AI?s applications, potential hazards, and legal and ethical implications. This is the first study to analyze medical students? views and awareness of AI in Jordan, as well as the first to include pathology residents? perspectives. The findings are consistent with earlier research internationally. In comparison with prior research, these attitudes are similar in low-income and industrialized countries, highlighting the need for a global strategy to introduce AI instruction to medical students everywhere in this era of rapidly expanding technology. UR - https://mededu.jmir.org/2025/1/e62669 UR - http://dx.doi.org/10.2196/62669 ID - info:doi/10.2196/62669 ER - TY - JOUR AU - Zhu, Shiben AU - Hu, Wanqin AU - Yang, Zhi AU - Yan, Jiani AU - Zhang, Fang PY - 2025/1/10 TI - Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study JO - JMIR Med Inform SP - e63731 VL - 13 KW - large language models KW - LLMs KW - Chinese National Nursing Licensing Examination KW - ChatGPT KW - Qwen-2.5 KW - multiple-choice questions KW - N2 - Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain?specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training. UR - https://medinform.jmir.org/2025/1/e63731 UR - http://dx.doi.org/10.2196/63731 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63731 ER - TY - JOUR AU - Zhang, Yong AU - Lu, Xiao AU - Luo, Yan AU - Zhu, Ying AU - Ling, Wenwu PY - 2025/1/9 TI - Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis JO - JMIR Med Inform SP - e63924 VL - 13 KW - chatbots KW - ChatGPT KW - ERNIE Bot KW - performance KW - accuracy rates KW - ultrasound KW - language KW - examination N2 - Background: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. Objective: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. Methods: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. Results: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot?s decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. Conclusions: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use. UR - https://medinform.jmir.org/2025/1/e63924 UR - http://dx.doi.org/10.2196/63924 ID - info:doi/10.2196/63924 ER - TY - JOUR AU - Bland, Tyler PY - 2025/1/6 TI - Enhancing Medical Student Engagement Through Cinematic Clinical Narratives: Multimodal Generative AI?Based Mixed Methods Study JO - JMIR Med Educ SP - e63865 VL - 11 KW - artificial intelligence KW - cinematic clinical narratives KW - cinemeducation KW - medical education KW - narrative learning KW - AI KW - medical student KW - pharmacology KW - preclinical education KW - long-term retention KW - AI tools KW - GPT-4 KW - image KW - applicability N2 - Background: Medical students often struggle to engage with and retain complex pharmacology topics during their preclinical education. Traditional teaching methods can lead to passive learning and poor long-term retention of critical concepts. Objective: This study aims to enhance the teaching of clinical pharmacology in medical school by using a multimodal generative artificial intelligence (genAI) approach to create compelling, cinematic clinical narratives (CCNs). Methods: We transformed a standard clinical case into an engaging, interactive multimedia experience called ?Shattered Slippers.? This CCN used various genAI tools for content creation: GPT-4 for developing the storyline, Leonardo.ai and Stable Diffusion for generating images, Eleven Labs for creating audio narrations, and Suno for composing a theme song. The CCN integrated narrative styles and pop culture references to enhance student engagement. It was applied in teaching first-year medical students about immune system pharmacology. Student responses were assessed through the Situational Interest Survey for Multimedia and examination performance. The target audience comprised first-year medical students (n=40), with 18 responding to the Situational Interest Survey for Multimedia survey (n=18). Results: The study revealed a marked preference for the genAI-enhanced CCNs over traditional teaching methods. Key findings include the majority of surveyed students preferring the CCN over traditional clinical cases (14/18), as well as high average scores for triggered situational interest (mean 4.58, SD 0.53), maintained interest (mean 4.40, SD 0.53), maintained-feeling interest (mean 4.38, SD 0.51), and maintained-value interest (mean 4.42, SD 0.54). Students achieved an average score of 88% on examination questions related to the CCN material, indicating successful learning and retention. Qualitative feedback highlighted increased engagement, improved recall, and appreciation for the narrative style and pop culture references. Conclusions: This study demonstrates the potential of using a multimodal genAI-driven approach to create CCNs in medical education. The ?Shattered Slippers? case effectively enhanced student engagement and promoted knowledge retention in complex pharmacological topics. This innovative method suggests a novel direction for curriculum development that could improve learning outcomes and student satisfaction in medical education. Future research should explore the long-term retention of knowledge and the applicability of learned material in clinical settings, as well as the potential for broader implementation of this approach across various medical education contexts. UR - https://mededu.jmir.org/2025/1/e63865 UR - http://dx.doi.org/10.2196/63865 ID - info:doi/10.2196/63865 ER - TY - JOUR AU - Wang, Heng AU - Zheng, Danni AU - Wang, Mengying AU - Ji, Hong AU - Han, Jiangli AU - Wang, Yan AU - Shen, Ning AU - Qiao, Jie PY - 2025/1/3 TI - Artificial Intelligence?Powered Training Database for Clinical Thinking: App Development Study JO - JMIR Form Res SP - e58426 VL - 9 KW - artificial intelligence KW - clinical thinking ability KW - virtual medical records KW - distance education KW - medical education KW - online learning N2 - Background: With the development of artificial intelligence (AI), medicine has entered the era of intelligent medicine, and various aspects, such as medical education and talent cultivation, are also being redefined. The cultivation of clinical thinking abilities poses a formidable challenge even for seasoned clinical educators, as offline training modalities often fall short in bridging the divide between current practice and the desired ideal. Consequently, there arises an imperative need for the expeditious development of a web-based database, tailored to empower physicians in their quest to learn and hone their clinical reasoning skills. Objective: This study aimed to introduce an app named ?XueYiKu,? which includes consultations, physical examinations, auxiliary examinations, and diagnosis, incorporating AI and actual complete hospital medical records to build an online-learning platform using human-computer interaction. Methods: The ?XueYiKu? app was designed as a contactless, self-service, trial-and-error system application based on actual complete hospital medical records and natural language processing technology to comprehensively assess the ?clinical competence? of residents at different stages. Case extraction was performed at a hospital?s case data center, and the best-matching cases were differentiated through natural language processing, word segmentation, synonym conversion, and sorting. More than 400 teaching cases covering 65 kinds of diseases were released for students to learn, and the subjects covered internal medicine, surgery, gynecology and obstetrics, and pediatrics. The difficulty of learning cases was divided into four levels in ascending order. Moreover, the learning and teaching effects were evaluated using 6 dimensions covering systematicness, agility, logic, knowledge expansion, multidimensional evaluation indicators, and preciseness. Results: From the app?s first launch on the Android platform in May 2019 to the last version updated in May 2023, the total number of teacher and student users was 6209 and 1180, respectively. The top 3 subjects most frequently learned were respirology (n=606, 24.1%), general surgery (n=506, 20.1%), and urinary surgery (n=390, 15.5%). For diseases, pneumonia was the most frequently learned, followed by cholecystolithiasis (n=216, 14.1%), benign prostate hyperplasia (n=196, 12.8%), and bladder tumor (n=193, 12.6%). Among 479 students, roughly a third (n=168, 35.1%) scored in the 60 to 80 range, and half of them scored over 80 points (n=238, 49.7%). The app enabled medical students? learning to become more active and self-motivated, with a variety of formats, and provided real-time feedback through assessments on the platform. The learning effect was satisfactory overall and provided important precedence for establishing scientific models and methods for assessing clinical thinking skills in the future. Conclusions: The integration of AI and medical education will undoubtedly assist in the restructuring of education processes; promote the evolution of the education ecosystem; and provide new convenient ways for independent learning, interactive communication, and educational resource sharing. UR - https://formative.jmir.org/2025/1/e58426 UR - http://dx.doi.org/10.2196/58426 ID - info:doi/10.2196/58426 ER - TY - JOUR AU - Wang, Chenxu AU - Li, Shuhan AU - Lin, Nuoxi AU - Zhang, Xinyu AU - Han, Ying AU - Wang, Xiandi AU - Liu, Di AU - Tan, Xiaomei AU - Pu, Dan AU - Li, Kang AU - Qian, Guangwu AU - Yin, Rong PY - 2025/1/1 TI - Application of Large Language Models in Medical Training Evaluation?Using ChatGPT as a Standardized Patient: Multimetric Assessment JO - J Med Internet Res SP - e59435 VL - 27 KW - ChatGPT KW - artificial intelligence KW - standardized patient KW - health care KW - prompt engineering KW - accuracy KW - large language models KW - performance evaluation KW - medical training KW - inflammatory bowel disease N2 - Background: With the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks. Objective: The study aims to explore ChatGPT?s viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments. Methods: A 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT?s performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT?s response shortcomings, with a comparative analysis of ChatGPT?s performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot. Results: The feasibility test confirmed ChatGPT?s ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P<.001), between the poor and good (85, SD 3.27) inquiry groups (P<.001) were significant at a significance level (?) of .05, while the score differences between the medium and good inquiry groups were not statistically significant (P=.16). The revised prompt significantly improved ChatGPT?s realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompts. The score difference percentage drops from 29.83% to 6.06%, with a drop in SD from 0.55 to 0.068. The performance of the chatbot on a separate script is acceptable with an average score difference percentage of 3.21%. Moreover, the performance differences between test groups using various language combinations were found to be insignificant. Conclusions: ChatGPT, as a representative LLM, is a viable tool for simulating standardized patients in medical assessments, with the potential to enhance medical training. By incorporating proper prompts, ChatGPT?s scoring accuracy and response realism significantly improved, approaching the feasibility of actual clinical use. Also, the influence of the adopted language is nonsignificant on the outcome of the chatbot. UR - https://www.jmir.org/2025/1/e59435 UR - http://dx.doi.org/10.2196/59435 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59435 ER - TY - JOUR AU - Miyazaki, Yuki AU - Hata, Masahiro AU - Omori, Hisaki AU - Hirashima, Atsuya AU - Nakagawa, Yuta AU - Eto, Mitsuhiro AU - Takahashi, Shun AU - Ikeda, Manabu PY - 2024/12/24 TI - Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions JO - JMIR Med Educ SP - e63129 VL - 10 KW - medical education KW - artificial intelligence KW - clinical decision-making KW - GPT-4o KW - medical licensing examination KW - Japan KW - images KW - accuracy KW - AI technology KW - application KW - decision-making KW - image-based KW - reliability KW - ChatGPT UR - https://mededu.jmir.org/2024/1/e63129 UR - http://dx.doi.org/10.2196/63129 ID - info:doi/10.2196/63129 ER - TY - JOUR AU - Ogundiya, Oluwadamilola AU - Rahman, Jasmine Thahmina AU - Valnarov-Boulter, Ioan AU - Young, Michael Tim PY - 2024/12/19 TI - Looking Back on Digital Medical Education Over the Last 25 Years and Looking to the Future: Narrative Review JO - J Med Internet Res SP - e60312 VL - 26 KW - digital health KW - digital medical education KW - health education KW - medical education KW - mobile phone KW - artificial intelligence KW - AI N2 - Background: The last 25 years have seen enormous progression in digital technologies across the whole of the health service, including health education. The rapid evolution and use of web-based and digital techniques have been significantly transforming this field since the beginning of the new millennium. These advancements continue to progress swiftly, even more so after the COVID-19 pandemic. Objective: This narrative review aims to outline and discuss the developments that have taken place in digital medical education across the defined time frame. In addition, evidence for potential opportunities and challenges facing digital medical education in the near future was collated for analysis. Methods: Literature reviews were conducted using PubMed, Web of Science Core Collection, Scopus, Google Scholar, and Embase. The participants and learners in this study included medical students, physicians in training or continuing professional development, nurses, paramedics, and patients. Results: Evidence of the significant steps in the development of digital medical education in the past 25 years was presented and analyzed in terms of application, impact, and implications for the future. The results were grouped into the following themes for discussion: learning management systems; telemedicine (in digital medical education); mobile health; big data analytics; the metaverse, augmented reality, and virtual reality; the COVID-19 pandemic; artificial intelligence; and ethics and cybersecurity. Conclusions: Major changes and developments in digital medical education have occurred from around the start of the new millennium. Key steps in this journey include technical developments in teleconferencing and learning management systems, along with a marked increase in mobile device use for accessing learning over this time. While the pace of evolution in digital medical education accelerated during the COVID-19 pandemic, further rapid progress has continued since the resolution of the pandemic. Many of these changes are currently being widely used in health education and other fields, such as augmented reality, virtual reality, and artificial intelligence, providing significant future potential. The opportunities these technologies offer must be balanced against the associated challenges in areas such as cybersecurity, the integrity of web-based assessments, ethics, and issues of digital privacy to ensure that digital medical education continues to thrive in the future. UR - https://www.jmir.org/2024/1/e60312 UR - http://dx.doi.org/10.2196/60312 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60312 ER - TY - JOUR AU - Roos, Jonas AU - Martin, Ron AU - Kaczmarczyk, Robert PY - 2024/12/17 TI - Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study JO - JMIR Form Res SP - e57592 VL - 8 KW - medical education KW - visual question answering KW - image analysis KW - large language model KW - LLM KW - student KW - performance KW - comparative KW - case study KW - artificial intelligence KW - AI KW - ChatGPT KW - effectiveness KW - diagnostic KW - training KW - accuracy KW - utility KW - image-based KW - question KW - image KW - AMBOSS KW - English KW - German KW - question and answer KW - Python KW - AI in health care KW - health care N2 - Background: The rapid development of large language models (LLMs) such as OpenAI?s ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities. Objective: This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations. Methods: This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the ?student passed mean? and ?majority vote.? Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization. Results: GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard?s 44.6% (477/1070), a statistically significant difference (?2?=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard?s 4.1% (44/1070; ?2?=83.1, P<.001). When considering only answered questions, GPT-4 1106?s accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; ?2?=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; ?2?=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; ?2?=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; ?2?=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: ?2?=408.5, P<.001; Bard Gemini Pro: ?2?=626.6, P<.001). Conclusions: Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical. UR - https://formative.jmir.org/2024/1/e57592 UR - http://dx.doi.org/10.2196/57592 ID - info:doi/10.2196/57592 ER - TY - JOUR AU - Dzuali, Fiatsogbe AU - Seiger, Kira AU - Novoa, Roberto AU - Aleshin, Maria AU - Teng, Joyce AU - Lester, Jenna AU - Daneshjou, Roxana PY - 2024/12/10 TI - ChatGPT May Improve Access to Language-Concordant Care for Patients With Non?English Language Preferences JO - JMIR Med Educ SP - e51435 VL - 10 KW - ChatGPT KW - artificial intelligence KW - language KW - translation KW - health care disparity KW - natural language model KW - survey KW - patient education KW - preference KW - human language KW - language-concordant care UR - https://mededu.jmir.org/2024/1/e51435 UR - http://dx.doi.org/10.2196/51435 ID - info:doi/10.2196/51435 ER - TY - JOUR AU - Jin, Kyung Hye AU - Kim, EunYoung PY - 2024/12/4 TI - Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study JO - JMIR Med Educ SP - e57451 VL - 10 KW - GPT-3.5 KW - GPT-4 KW - Korean KW - Korean Pharmacist Licensing Examination KW - KPLE N2 - Background: ChatGPT, a recently developed artificial intelligence chatbot and a notable large language model, has demonstrated improved performance on medical field examinations. However, there is currently little research on its efficacy in languages other than English or in pharmacy-related examinations. Objective: This study aimed to evaluate the performance of GPT models on the Korean Pharmacist Licensing Examination (KPLE). Methods: We evaluated the percentage of correct answers provided by 2 different versions of ChatGPT (GPT-3.5 and GPT-4) for all multiple-choice single-answer KPLE questions, excluding image-based questions. In total, 320, 317, and 323 questions from the 2021, 2022, and 2023 KPLEs, respectively, were included in the final analysis, which consisted of 4 units: Biopharmacy, Industrial Pharmacy, Clinical and Practical Pharmacy, and Medical Health Legislation. Results: The 3-year average percentage of correct answers was 86.5% (830/960) for GPT-4 and 60.7% (583/960) for GPT-3.5. GPT model accuracy was highest in Biopharmacy (GPT-3.5 77/96, 80.2% in 2022; GPT-4 87/90, 96.7% in 2021) and lowest in Medical Health Legislation (GPT-3.5 8/20, 40% in 2022; GPT-4 12/20, 60% in 2022). Additionally, when comparing the performance of artificial intelligence with that of human participants, pharmacy students outperformed GPT-3.5 but not GPT-4. Conclusions: In the last 3 years, GPT models have performed very close to or exceeded the passing threshold for the KPLE. This study demonstrates the potential of large language models in the pharmacy domain; however, extensive research is needed to evaluate their reliability and ensure their secure application in pharmacy contexts due to several inherent challenges. Addressing these limitations could make GPT models more effective auxiliary tools for pharmacy education. UR - https://mededu.jmir.org/2024/1/e57451 UR - http://dx.doi.org/10.2196/57451 ID - info:doi/10.2196/57451 ER - TY - JOUR AU - Luo, Yuan AU - Miao, Yiqun AU - Zhao, Yuhan AU - Li, Jiawei AU - Chen, Yuling AU - Yue, Yuexue AU - Wu, Ying PY - 2024/12/2 TI - Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study JO - JMIR Form Res SP - e63188 VL - 8 KW - rumor KW - misconception KW - health science popularization KW - health education KW - large language model KW - LLM KW - applicability KW - accuracy KW - effectiveness KW - health related KW - education KW - health science KW - proof of concept N2 - Background: Health-related rumors and misconceptions are spreading at an alarming rate, fueled by the rapid development of the internet and the exponential growth of social media platforms. This phenomenon has become a pressing global concern, as the dissemination of false information can have severe consequences, including widespread panic, social instability, and even public health crises. Objective: The aim of the study is to compare the accuracy of rumor identification and the effectiveness of health science popularization between 2 generated large language models in Chinese (GPT-4 by OpenAI and Enhanced Representation through Knowledge Integration Bot [ERNIE Bot] 4.0 by Baidu). Methods: In total, 20 health rumors and misconceptions, along with 10 health truths, were randomly inputted into GPT-4 and ERNIE Bot 4.0. We prompted them to determine whether the statements were rumors or misconceptions and provide explanations for their judgment. Further, we asked them to generate a health science popularization essay. We evaluated the outcomes in terms of accuracy, effectiveness, readability, and applicability. Accuracy was assessed by the rate of correctly identifying health-related rumors, misconceptions, and truths. Effectiveness was determined by the accuracy of the generated explanation, which was assessed collaboratively by 2 research team members with a PhD in nursing. Readability was calculated by the readability formula of Chinese health education materials. Applicability was evaluated by the Chinese Suitability Assessment of Materials. Results: GPT-4 and ERNIE Bot 4.0 correctly identified all health rumors and misconceptions (100% accuracy rate). For truths, the accuracy rate was 70% (7/10) and 100% (10/10), respectively. Both mostly provided widely recognized viewpoints without obvious errors. The average readability score for the health essays was 2.92 (SD 0.85) for GPT-4 and 3.02 (SD 0.84) for ERNIE Bot 4.0 (P=.65). For applicability, except for the content and cultural appropriateness category, significant differences were observed in the total score and scores in other dimensions between them (P<.05). Conclusions: ERNIE Bot 4.0 demonstrated similar accuracy to GPT-4 in identifying Chinese rumors. Both provided widely accepted views, despite some inaccuracies. These insights enhance understanding and correct misunderstandings. For health essays, educators can learn from readable language styles of GLLMs. Finally, ERNIE Bot 4.0 aligns with Chinese expression habits, making it a good choice for a better Chinese reading experience. UR - https://formative.jmir.org/2024/1/e63188 UR - http://dx.doi.org/10.2196/63188 ID - info:doi/10.2196/63188 ER - TY - JOUR AU - Ehrett, Carl AU - Hegde, Sudeep AU - Andre, Kwame AU - Liu, Dixizi AU - Wilson, Timothy PY - 2024/11/19 TI - Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study JO - JMIR Med Educ SP - e51433 VL - 10 KW - data augmentation KW - large language models KW - medical education KW - natural language processing KW - data security KW - ethics KW - AI KW - artificial intelligence KW - data privacy KW - medical staff N2 - Background: Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI?s ChatGPT. Objective: This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods: The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results: The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers? performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions: The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field. UR - https://mededu.jmir.org/2024/1/e51433 UR - http://dx.doi.org/10.2196/51433 ID - info:doi/10.2196/51433 ER - TY - JOUR AU - Zhou, You AU - Li, Si-Jia AU - Tang, Xing-Yi AU - He, Yi-Chen AU - Ma, Hao-Ming AU - Wang, Ao-Qi AU - Pei, Run-Yuan AU - Piao, Mei-Hua PY - 2024/11/19 TI - Using ChatGPT in Nursing: Scoping Review of Current Opinions JO - JMIR Med Educ SP - e54297 VL - 10 KW - ChatGPT KW - large language model KW - nursing KW - artificial intelligence KW - scoping review KW - generative AI KW - nursing education N2 - Background: Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields. Objective: We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT?s strengths, weaknesses, and the potential impacts it may cause. Methods: This scoping review was conducted following the framework of Arksey and O?Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently. Results: A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on ?ChatGPT and nursing education? (20 studies), ?ChatGPT and nursing practice? (10 studies), and ?ChatGPT and nursing research, writing, and examination? (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields. Conclusions: As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice. UR - https://mededu.jmir.org/2024/1/e54297 UR - http://dx.doi.org/10.2196/54297 ID - info:doi/10.2196/54297 ER - TY - JOUR AU - Ros-Arlanzķn, Pablo AU - Perez-Sempere, Angel PY - 2024/11/14 TI - Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain JO - JMIR Med Educ SP - e56762 VL - 10 KW - artificial intelligence KW - ChatGPT KW - clinical decision-making KW - medical education KW - medical knowledge assessment KW - OpenAI N2 - Background: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Objective: This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI?s capabilities and limitations in medical knowledge. Methods: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom?s Taxonomy. Statistical analysis of performance, including the ? coefficient for response consistency, was performed. Results: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher ? coefficient of 0.73, compared to ChatGPT-3.5?s coefficient of 0.69. Conclusions: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4?s performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment. UR - https://mededu.jmir.org/2024/1/e56762 UR - http://dx.doi.org/10.2196/56762 ID - info:doi/10.2196/56762 ER - TY - JOUR AU - Ming, Shuai AU - Yao, Xi AU - Guo, Xiaohong AU - Guo, Qingge AU - Xie, Kunpeng AU - Chen, Dandan AU - Lei, Bo PY - 2024/11/14 TI - Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study JO - J Med Internet Res SP - e60226 VL - 26 KW - artificial intelligence KW - chatbot KW - ChatGPT KW - ophthalmic registration KW - clinical diagnosis KW - AI KW - cross-sectional study KW - eye disease KW - eye disorder KW - ophthalmology KW - health care KW - outpatient registration KW - clinical KW - decision-making KW - generative AI KW - vision impairment N2 - Background: Artificial intelligence (AI) chatbots such as ChatGPT are expected to impact vision health care significantly. Their potential to optimize the consultation process and diagnostic capabilities across range of ophthalmic subspecialties have yet to be fully explored. Objective: This study aims to investigate the performance of AI chatbots in recommending ophthalmic outpatient registration and diagnosing eye diseases within clinical case profiles. Methods: This cross-sectional study used clinical cases from Chinese Standardized Resident Training?Ophthalmology (2nd Edition). For each case, 2 profiles were created: patient with history (Hx) and patient with history and examination (Hx+Ex). These profiles served as independent queries for GPT-3.5 and GPT-4.0 (accessed from March 5 to 18, 2024). Similarly, 3 ophthalmic residents were posed the same profiles in a questionnaire format. The accuracy of recommending ophthalmic subspecialty registration was primarily evaluated using Hx profiles. The accuracy of the top-ranked diagnosis and the accuracy of the diagnosis within the top 3 suggestions (do-not-miss diagnosis) were assessed using Hx+Ex profiles. The gold standard for judgment was the published, official diagnosis. Characteristics of incorrect diagnoses by ChatGPT were also analyzed. Results: A total of 208 clinical profiles from 12 ophthalmic subspecialties were analyzed (104 Hx and 104 Hx+Ex profiles). For Hx profiles, GPT-3.5, GPT-4.0, and residents showed comparable accuracy in registration suggestions (66/104, 63.5%; 81/104, 77.9%; and 72/104, 69.2%, respectively; P=.07), with ocular trauma, retinal diseases, and strabismus and amblyopia achieving the top 3 accuracies. For Hx+Ex profiles, both GPT-4.0 and residents demonstrated higher diagnostic accuracy than GPT-3.5 (62/104, 59.6% and 63/104, 60.6% vs 41/104, 39.4%; P=.003 and P=.001, respectively). Accuracy for do-not-miss diagnoses also improved (79/104, 76% and 68/104, 65.4% vs 51/104, 49%; P<.001 and P=.02, respectively). The highest diagnostic accuracies were observed in glaucoma; lens diseases; and eyelid, lacrimal, and orbital diseases. GPT-4.0 recorded fewer incorrect top-3 diagnoses (25/42, 60% vs 53/63, 84%; P=.005) and more partially correct diagnoses (21/42, 50% vs 7/63 11%; P<.001) than GPT-3.5, while GPT-3.5 had more completely incorrect (27/63, 43% vs 7/42, 17%; P=.005) and less precise diagnoses (22/63, 35% vs 5/42, 12%; P=.009). Conclusions: GPT-3.5 and GPT-4.0 showed intermediate performance in recommending ophthalmic subspecialties for registration. While GPT-3.5 underperformed, GPT-4.0 approached and numerically surpassed residents in differential diagnosis. AI chatbots show promise in facilitating ophthalmic patient registration. However, their integration into diagnostic decision-making requires more validation. UR - https://www.jmir.org/2024/1/e60226 UR - http://dx.doi.org/10.2196/60226 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60226 ER - TY - JOUR AU - Bicknell, T. Brenton AU - Butler, Danner AU - Whalen, Sydney AU - Ricks, James AU - Dixon, J. Cory AU - Clark, B. Abigail AU - Spaedy, Olivia AU - Skelton, Adam AU - Edupuganti, Neel AU - Dzubinski, Lance AU - Tate, Hudson AU - Dyess, Garrett AU - Lindeman, Brenessa AU - Lehmann, Soleymani Lisa PY - 2024/11/6 TI - ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis JO - JMIR Med Educ SP - e63430 VL - 10 KW - large language model KW - ChatGPT KW - medical education KW - USMLE KW - AI in medical education KW - medical student resources KW - educational technology KW - artificial intelligence in medicine KW - clinical skills KW - LLM KW - medical licensing examination KW - medical students KW - United States Medical Licensing Examination KW - ChatGPT 4 Omni KW - ChatGPT 4 KW - ChatGPT 3.5 N2 - Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models? performances. Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o?s highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o?s diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3?60.3). Conclusions: GPT-4o?s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness. UR - https://mededu.jmir.org/2024/1/e63430 UR - http://dx.doi.org/10.2196/63430 ID - info:doi/10.2196/63430 ER - TY - JOUR AU - Alli, Rabia Sauliha AU - Hossain, Qahh?r Soaad AU - Das, Sunit AU - Upshur, Ross PY - 2024/11/4 TI - The Potential of Artificial Intelligence Tools for Reducing Uncertainty in Medicine and Directions for Medical Education JO - JMIR Med Educ SP - e51446 VL - 10 KW - artificial intelligence KW - machine learning KW - uncertainty KW - clinical decision-making KW - medical education KW - generative AI KW - generative artificial intelligence UR - https://mededu.jmir.org/2024/1/e51446 UR - http://dx.doi.org/10.2196/51446 ID - info:doi/10.2196/51446 ER - TY - JOUR AU - Tao, Wenjuan AU - Yang, Jinming AU - Qu, Xing PY - 2024/10/28 TI - Utilization of, Perceptions on, and Intention to Use AI Chatbots Among Medical Students in China: National Cross-Sectional Study JO - JMIR Med Educ SP - e57132 VL - 10 KW - medical education KW - artificial intelligence KW - UTAUT model KW - utilization KW - medical students KW - cross-sectional study KW - AI chatbots KW - China KW - acceptance KW - electronic survey KW - social media KW - medical information KW - risk KW - training KW - support N2 - Background: Artificial intelligence (AI) chatbots are poised to have a profound impact on medical education. Medical students, as early adopters of technology and future health care providers, play a crucial role in shaping the future of health care. However, little is known about the utilization of, perceptions on, and intention to use AI chatbots among medical students in China. Objective: This study aims to explore the utilization of, perceptions on, and intention to use generative AI chatbots among medical students in China, using the Unified Theory of Acceptance and Use of Technology (UTAUT) framework. By conducting a national cross-sectional survey, we sought to identify the key determinants that influence medical students? acceptance of AI chatbots, thereby providing a basis for enhancing their integration into medical education. Understanding these factors is crucial for educators, policy makers, and technology developers to design and implement effective AI-driven educational tools that align with the needs and expectations of future health care professionals. Methods: A web-based electronic survey questionnaire was developed and distributed via social media to medical students across the country. The UTAUT was used as a theoretical framework to design the questionnaire and analyze the data. The relationship between behavioral intention to use AI chatbots and UTAUT predictors was examined using multivariable regression. Results: A total of 693 participants were from 57 universities covering 21 provinces or municipalities in China. Only a minority (199/693, 28.72%) reported using AI chatbots for studying, with ChatGPT (129/693, 18.61%) being the most commonly used. Most of the participants used AI chatbots for quickly obtaining medical information and knowledge (631/693, 91.05%) and increasing learning efficiency (594/693, 85.71%). Utilization behavior, social influence, facilitating conditions, perceived risk, and personal innovativeness showed significant positive associations with the behavioral intention to use AI chatbots (all P values were <.05). Conclusions: Chinese medical students hold positive perceptions toward and high intentions to use AI chatbots, but there are gaps between intention and actual adoption. This highlights the need for strategies to improve access, training, and support and provide peer usage examples to fully harness the potential benefits of chatbot technology. UR - https://mededu.jmir.org/2024/1/e57132 UR - http://dx.doi.org/10.2196/57132 ID - info:doi/10.2196/57132 ER - TY - JOUR AU - Wang, Shuang AU - Yang, Liuying AU - Li, Min AU - Zhang, Xinghe AU - Tai, Xiantao PY - 2024/10/10 TI - Medical Education and Artificial Intelligence: Web of Science?Based Bibliometric Analysis (2013-2022) JO - JMIR Med Educ SP - e51411 VL - 10 KW - artificial intelligence KW - medical education KW - bibliometric analysis KW - CiteSpace KW - VOSviewer N2 - Background: Incremental advancements in artificial intelligence (AI) technology have facilitated its integration into various disciplines. In particular, the infusion of AI into medical education has emerged as a significant trend, with noteworthy research findings. Consequently, a comprehensive review and analysis of the current research landscape of AI in medical education is warranted. Objective: This study aims to conduct a bibliometric analysis of pertinent papers, spanning the years 2013?2022, using CiteSpace and VOSviewer. The study visually represents the existing research status and trends of AI in medical education. Methods: Articles related to AI and medical education, published between 2013 and 2022, were systematically searched in the Web of Science core database. Two reviewers scrutinized the initially retrieved papers, based on their titles and abstracts, to eliminate papers unrelated to the topic. The selected papers were then analyzed and visualized for country, institution, author, reference, and keywords using CiteSpace and VOSviewer. Results: A total of 195 papers pertaining to AI in medical education were identified from 2013 to 2022. The annual publications demonstrated an increasing trend over time. The United States emerged as the most active country in this research arena, and Harvard Medical School and the University of Toronto were the most active institutions. Prolific authors in this field included Vincent Bissonnette, Charlotte Blacketer, Rolando F Del Maestro, Nicole Ledows, Nykan Mirchi, Alexander Winkler-Schwartz, and Recai Yilamaz. The paper with the highest citation was ?Medical Students? Attitude Towards Artificial Intelligence: A Multicentre Survey.? Keyword analysis revealed that ?radiology,? ?medical physics,? ?ehealth,? ?surgery,? and ?specialty? were the primary focus, whereas ?big data? and ?management? emerged as research frontiers. Conclusions: The study underscores the promising potential of AI in medical education research. Current research directions encompass radiology, medical information management, and other aspects. Technological progress is expected to broaden these directions further. There is an urgent need to bolster interregional collaboration and enhance research quality. These findings offer valuable insights for researchers to identify perspectives and guide future research directions. UR - https://mededu.jmir.org/2024/1/e51411 UR - http://dx.doi.org/10.2196/51411 ID - info:doi/10.2196/51411 ER - TY - JOUR AU - Miao, Jing AU - Thongprayoon, Charat AU - Garcia Valencia, Oscar AU - Craici, M. Iasmina AU - Cheungpasitporn, Wisit PY - 2024/10/10 TI - Navigating Nephrology's Decline Through a GPT-4 Analysis of Internal Medicine Specialties in the United States: Qualitative Study JO - JMIR Med Educ SP - e57157 VL - 10 KW - artificial intelligence KW - ChatGPT KW - nephrology fellowship training KW - fellowship matching KW - medical education KW - AI KW - nephrology KW - fellowship KW - United States KW - factor KW - chatbots KW - intellectual KW - complexity KW - work-life balance KW - procedural involvement KW - opportunity KW - career demand KW - financial compensation N2 - Background: The 2024 Nephrology fellowship match data show the declining interest in nephrology in the United States, with an 11% drop in candidates and a mere 66% (321/488) of positions filled. Objective: The study aims to discern the factors influencing this trend using ChatGPT, a leading chatbot model, for insights into the comparative appeal of nephrology versus other internal medicine specialties. Methods: Using the GPT-4 model, the study compared nephrology with 13 other internal medicine specialties, evaluating each on 7 criteria including intellectual complexity, work-life balance, procedural involvement, research opportunities, patient relationships, career demand, and financial compensation. Each criterion was assigned scores from 1 to 10, with the cumulative score determining the ranking. The approach included counteracting potential bias by instructing GPT-4 to favor other specialties over nephrology in reverse scenarios. Results: GPT-4 ranked nephrology only above sleep medicine. While nephrology scored higher than hospice and palliative medicine, it fell short in key criteria such as work-life balance, patient relationships, and career demand. When examining the percentage of filled positions in the 2024 appointment year match, nephrology?s filled rate was 66%, only higher than the 45% (155/348) filled rate of geriatric medicine. Nephrology?s score decreased by 4%?14% in 5 criteria including intellectual challenge and complexity, procedural involvement, career opportunity and demand, research and academic opportunities, and financial compensation. Conclusions: ChatGPT does not favor nephrology over most internal medicine specialties, highlighting its diminishing appeal as a career choice. This trend raises significant concerns, especially considering the overall physician shortage, and prompts a reevaluation of factors affecting specialty choice among medical residents. UR - https://mededu.jmir.org/2024/1/e57157 UR - http://dx.doi.org/10.2196/57157 ID - info:doi/10.2196/57157 ER - TY - JOUR AU - Goodings, James Anthony AU - Kajitani, Sten AU - Chhor, Allison AU - Albakri, Ahmad AU - Pastrak, Mila AU - Kodancha, Megha AU - Ives, Rowan AU - Lee, Bin Yoo AU - Kajitani, Kari PY - 2024/10/8 TI - Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study JO - JMIR Med Educ SP - e56128 VL - 10 KW - ChatGPT-4 KW - Family Medicine Board Examination KW - artificial intelligence in medical education KW - AI performance assessment KW - prompt engineering KW - ChatGPT KW - artificial intelligence KW - AI KW - medical education KW - assessment KW - observational KW - analytical method KW - data analysis KW - examination N2 - Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, ?AI Family Medicine Board Exam Taker,? designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI?s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4?s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4?s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. UR - https://mededu.jmir.org/2024/1/e56128 UR - http://dx.doi.org/10.2196/56128 ID - info:doi/10.2196/56128 ER - TY - JOUR AU - Choi, K. Yong AU - Lin, Shih-Yin AU - Fick, Marie Donna AU - Shulman, W. Richard AU - Lee, Sangil AU - Shrestha, Priyanka AU - Santoso, Kate PY - 2024/10/1 TI - Optimizing ChatGPT?s Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study JO - JMIR Form Res SP - e51383 VL - 8 KW - generative artificial intelligence KW - generative AI KW - large language models KW - ChatGPT KW - delirium detection KW - Sour Seven Questionnaire KW - prompt engineering KW - clinical vignettes KW - medical education KW - caregiver education N2 - Background: Generative artificial intelligence (AI) and large language models, such as OpenAI?s ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. Objective: This exploratory study aims to evaluate and optimize ChatGPT?s capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models? interpretation and reporting accuracy through iterative prompt optimization. Methods: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI?s processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool?s criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. Results: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models? capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive ?Yes? or ?No? responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. Conclusions: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research. UR - https://formative.jmir.org/2024/1/e51383 UR - http://dx.doi.org/10.2196/51383 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/51383 ER - TY - JOUR AU - Claman, Daniel AU - Sezgin, Emre PY - 2024/9/27 TI - Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models JO - JMIR Med Educ SP - e52346 VL - 10 KW - artificial intelligence KW - large language models KW - dental education KW - GPT KW - ChatGPT KW - periodontal health KW - AI KW - LLM KW - LLMs KW - chatbot KW - natural language KW - generative pretrained transformer KW - innovation KW - technology KW - large language model UR - https://mededu.jmir.org/2024/1/e52346 UR - http://dx.doi.org/10.2196/52346 ID - info:doi/10.2196/52346 ER - TY - JOUR AU - Yamamoto, Akira AU - Koda, Masahide AU - Ogawa, Hiroko AU - Miyoshi, Tomoko AU - Maeda, Yoshinobu AU - Otsuka, Fumio AU - Ino, Hideo PY - 2024/9/23 TI - Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial JO - JMIR Med Educ SP - e58753 VL - 10 KW - medical interview KW - generative pretrained transformer KW - large language model KW - simulation-based learning KW - OSCE KW - artificial intelligence KW - medical education KW - simulated patients KW - nonrandomized controlled trial N2 - Background: Medical interviewing is a critical skill in clinical practice, yet opportunities for practical training are limited in Japanese medical schools, necessitating urgent measures. Given advancements in artificial intelligence (AI) technology, its application in the medical field is expanding. However, reports on its application in medical interviews in medical education are scarce. Objective: This study aimed to investigate whether medical students? interview skills could be improved by engaging with AI-simulated patients using large language models, including the provision of feedback. Methods: This nonrandomized controlled trial was conducted with fourth-year medical students in Japan. A simulation program using large language models was provided to 35 students in the intervention group in 2023, while 110 students from 2022 who did not participate in the intervention were selected as the control group. The primary outcome was the score on the Pre-Clinical Clerkship Objective Structured Clinical Examination (pre-CC OSCE), a national standardized clinical skills examination, in medical interviewing. Secondary outcomes included surveys such as the Simulation-Based Training Quality Assurance Tool (SBT-QA10), administered at the start and end of the study. Results: The AI intervention group showed significantly higher scores on medical interviews than the control group (AI group vs control group: mean 28.1, SD 1.6 vs 27.1, SD 2.2; P=.01). There was a trend of inverse correlation between the SBT-QA10 and pre-CC OSCE scores (regression coefficient ?2.0 to ?2.1). No significant safety concerns were observed. Conclusions: Education through medical interviews using AI-simulated patients has demonstrated safety and a certain level of educational effectiveness. However, at present, the educational effects of this platform on nonverbal communication skills are limited, suggesting that it should be used as a supplementary tool to traditional simulation education. UR - https://mededu.jmir.org/2024/1/e58753 UR - http://dx.doi.org/10.2196/58753 UR - http://www.ncbi.nlm.nih.gov/pubmed/39312284 ID - info:doi/10.2196/58753 ER - TY - JOUR AU - Yoon, Soo-Hyuk AU - Oh, Kyeong Seok AU - Lim, Gun Byung AU - Lee, Ho-Jin PY - 2024/9/16 TI - Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study JO - JMIR Med Educ SP - e56859 VL - 10 KW - AI tools KW - problem solving KW - anesthesiology KW - artificial intelligence KW - pain medicine KW - ChatGPT KW - health care KW - medical education KW - South Korea N2 - Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4?s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. UR - https://mededu.jmir.org/2024/1/e56859 UR - http://dx.doi.org/10.2196/56859 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/56859 ER - TY - JOUR AU - Holderried, Friederike AU - Stegemann-Philipps, Christian AU - Herrmann-Werner, Anne AU - Festl-Wietek, Teresa AU - Holderried, Martin AU - Eickhoff, Carsten AU - Mahling, Moritz PY - 2024/8/16 TI - A Language Model?Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study JO - JMIR Med Educ SP - e59213 VL - 10 KW - virtual patients communication KW - communication skills KW - technology enhanced education KW - TEL KW - medical education KW - ChatGPT KW - GPT: LLM KW - LLMs KW - NLP KW - natural language processing KW - machine learning KW - artificial intelligence KW - language model KW - language models KW - communication KW - relationship KW - relationships KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents KW - history KW - histories KW - simulated KW - student KW - students KW - interaction KW - interactions N2 - Background: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback. Objective: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students? performance in history taking with a simulated patient. Methods: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients? responses and provide immediate feedback on the comprehensiveness of the students? history taking. Students? interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback. Results: Most of the study?s participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4?s role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed ?almost perfect? agreement (Cohen ?=0.832). Less agreement (?<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model?s assessments were overly specific or diverged from human judgement. Conclusions: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context. UR - https://mededu.jmir.org/2024/1/e59213 UR - http://dx.doi.org/10.2196/59213 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59213 ER - TY - JOUR AU - Ming, Shuai AU - Guo, Qingge AU - Cheng, Wenjun AU - Lei, Bo PY - 2024/8/13 TI - Influence of Model Evolution and System Roles on ChatGPT?s Performance in Chinese Medical Licensing Exams: Comparative Study JO - JMIR Med Educ SP - e52784 VL - 10 KW - ChatGPT KW - Chinese National Medical Licensing Examination KW - large language models KW - medical education KW - system role KW - LLM KW - LLMs KW - language model KW - language models KW - artificial intelligence KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents KW - exam KW - exams KW - examination KW - examinations KW - OpenAI KW - answer KW - answers KW - response KW - responses KW - accuracy KW - performance KW - China KW - Chinese N2 - Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt?s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The ?2 tests and ? values were employed to evaluate the model?s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with ? values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%?3.7%) and GPT-3.5 (1.3%?4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model?s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. UR - https://mededu.jmir.org/2024/1/e52784 UR - http://dx.doi.org/10.2196/52784 ID - info:doi/10.2196/52784 ER - TY - JOUR AU - Cherrez-Ojeda, Ivan AU - Gallardo-Bastidas, C. Juan AU - Robles-Velasco, Karla AU - Osorio, F. María AU - Velez Leon, Maria Eleonor AU - Leon Velastegui, Manuel AU - Pauletto, Patrícia AU - Aguilar-Díaz, C. F. AU - Squassi, Aldo AU - González Eras, Patricia Susana AU - Cordero Carrasco, Erita AU - Chavez Gonzalez, Leonor Karol AU - Calderon, C. Juan AU - Bousquet, Jean AU - Bedbrook, Anna AU - Faytong-Haro, Marco PY - 2024/8/13 TI - Understanding Health Care Students? Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study JO - JMIR Med Educ SP - e51757 VL - 10 KW - artificial intelligence KW - ChatGPT KW - education KW - health care KW - students N2 - Background: ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. Objective: The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants? attitudes toward the use of ChatGPT. Methods: A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. Results: Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was ?minimal? (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) ?somewhat agreed? that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). Conclusions: Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs. UR - https://mededu.jmir.org/2024/1/e51757 UR - http://dx.doi.org/10.2196/51757 UR - http://www.ncbi.nlm.nih.gov/pubmed/39137029 ID - info:doi/10.2196/51757 ER - TY - JOUR AU - Takahashi, Hiromizu AU - Shikino, Kiyoshi AU - Kondo, Takeshi AU - Komori, Akira AU - Yamada, Yuji AU - Saita, Mizue AU - Naito, Toshio PY - 2024/8/13 TI - Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study JO - JMIR Med Educ SP - e59133 VL - 10 KW - generative AI KW - ChatGPT-4 KW - medical case generation KW - medical education KW - clinical vignettes KW - AI KW - artificial intelligence KW - Japanese KW - Japan N2 - Background: Evaluating the accuracy and educational utility of artificial intelligence?generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. Objective: This study aimed to assess the educational utility of ChatGPT-4?generated clinical vignettes and their applicability in educational settings. Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians? experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. Conclusions: ChatGPT-4?generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4?s value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application. UR - https://mededu.jmir.org/2024/1/e59133 UR - http://dx.doi.org/10.2196/59133 UR - http://www.ncbi.nlm.nih.gov/pubmed/39137031 ID - info:doi/10.2196/59133 ER - TY - JOUR AU - McBee, C. Joseph AU - Han, Y. Daniel AU - Liu, Li AU - Ma, Leah AU - Adjeroh, A. Donald AU - Xu, Dong AU - Hu, Gangqing PY - 2024/8/7 TI - Assessing ChatGPT?s Competency in Addressing Interdisciplinary Inquiries on Chatbot Uses in Sports Rehabilitation: Simulation Study JO - JMIR Med Educ SP - e51157 VL - 10 KW - ChatGPT KW - chatbots KW - multirole-playing KW - interdisciplinary inquiry KW - medical education KW - sports medicine N2 - Background: ChatGPT showcases exceptional conversational capabilities and extensive cross-disciplinary knowledge. In addition, it can perform multiple roles in a single chat session. This unique multirole-playing feature positions ChatGPT as a promising tool for exploring interdisciplinary subjects. Objective: The aim of this study was to evaluate ChatGPT?s competency in addressing interdisciplinary inquiries based on a case study exploring the opportunities and challenges of chatbot uses in sports rehabilitation. Methods: We developed a model termed PanelGPT to assess ChatGPT?s competency in addressing interdisciplinary topics through simulated panel discussions. Taking chatbot uses in sports rehabilitation as an example of an interdisciplinary topic, we prompted ChatGPT through PanelGPT to role-play a physiotherapist, psychologist, nutritionist, artificial intelligence expert, and athlete in a simulated panel discussion. During the simulation, we posed questions to the panel while ChatGPT acted as both the panelists for responses and the moderator for steering the discussion. We performed the simulation using ChatGPT-4 and evaluated the responses by referring to the literature and our human expertise. Results: By tackling questions related to chatbot uses in sports rehabilitation with respect to patient education, physiotherapy, physiology, nutrition, and ethical considerations, responses from the ChatGPT-simulated panel discussion reasonably pointed to various benefits such as 24/7 support, personalized advice, automated tracking, and reminders. ChatGPT also correctly emphasized the importance of patient education, and identified challenges such as limited interaction modes, inaccuracies in emotion-related advice, assurance of data privacy and security, transparency in data handling, and fairness in model training. It also stressed that chatbots are to assist as a copilot, not to replace human health care professionals in the rehabilitation process. Conclusions: ChatGPT exhibits strong competency in addressing interdisciplinary inquiry by simulating multiple experts from complementary backgrounds, with significant implications in assisting medical education. UR - https://mededu.jmir.org/2024/1/e51157 UR - http://dx.doi.org/10.2196/51157 UR - http://www.ncbi.nlm.nih.gov/pubmed/39042885 ID - info:doi/10.2196/51157 ER - TY - JOUR AU - Aljamaan, Fadi AU - Temsah, Mohamad-Hani AU - Altamimi, Ibraheem AU - Al-Eyadhy, Ayman AU - Jamal, Amr AU - Alhasan, Khalid AU - Mesallam, A. Tamer AU - Farahat, Mohamed AU - Malki, H. Khalid PY - 2024/7/31 TI - Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study JO - JMIR Med Inform SP - e54345 VL - 12 KW - artificial intelligence (AI) chatbots KW - reference hallucination KW - bibliographic verification KW - ChatGPT KW - Perplexity KW - SciSpace KW - Elicit KW - Bing N2 - Background: Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation. Objective: The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots? citations. Methods: Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference?s relevance to prompts? keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots. Results: Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (? coefficient=?0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (? coefficient=?0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (? coefficient=0.486; P<.001). Conclusions: The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots? RHS could contribute to ongoing efforts to enhance AI?s general reliability in medical research. UR - https://medinform.jmir.org/2024/1/e54345 UR - http://dx.doi.org/10.2196/54345 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54345 ER - TY - JOUR AU - Zhui, Li AU - Yhap, Nina AU - Liping, Liu AU - Zhengjie, Wang AU - Zhonghao, Xiong AU - Xiaoshu, Yuan AU - Hong, Cui AU - Xuexiu, Liu AU - Wei, Ren PY - 2024/7/25 TI - Impact of Large Language Models on Medical Education and Teaching Adaptations JO - JMIR Med Inform SP - e55933 VL - 12 KW - large language models KW - medical education KW - opportunities KW - challenges KW - critical thinking KW - educator UR - https://medinform.jmir.org/2024/1/e55933 UR - http://dx.doi.org/10.2196/55933 ID - info:doi/10.2196/55933 ER - TY - JOUR AU - Burke, B. Harry AU - Hoang, Albert AU - Lopreiato, O. Joseph AU - King, Heidi AU - Hemmer, Paul AU - Montgomery, Michael AU - Gagarin, Viktoria PY - 2024/7/25 TI - Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study JO - JMIR Med Educ SP - e56342 VL - 10 KW - medical education KW - generative artificial intelligence KW - natural language processing KW - ChatGPT KW - generative pretrained transformer KW - standardized patients KW - clinical notes KW - free-text notes KW - history and physical examination KW - large language model KW - LLM KW - medical student KW - medical students KW - clinical information KW - artificial intelligence KW - AI KW - patients KW - patient KW - medicine N2 - Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students? free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students? notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students? standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. UR - https://mededu.jmir.org/2024/1/e56342 UR - http://dx.doi.org/10.2196/56342 ID - info:doi/10.2196/56342 ER - TY - JOUR AU - Noroozi, Mohammad AU - St John, Ace AU - Masino, Caterina AU - Laplante, Simon AU - Hunter, Jaryd AU - Brudno, Michael AU - Madani, Amin AU - Kersten-Oertel, Marta PY - 2024/7/25 TI - Education in Laparoscopic Cholecystectomy: Design and Feasibility Study of the LapBot Safe Chole Mobile Game JO - JMIR Form Res SP - e52878 VL - 8 KW - gamification KW - serious games KW - surgery KW - education KW - laparoscopic cholecystectomy KW - artificial intelligence KW - AI KW - laparoscope KW - gallbladder KW - cholecystectomy KW - mobile game KW - gamify KW - educational game KW - interactive KW - decision-making KW - mobile phone N2 - Background:  Major bile duct injuries during laparoscopic cholecystectomy (LC), often stemming from errors in surgical judgment and visual misperception of critical anatomy, significantly impact morbidity, mortality, disability, and health care costs. Objective:  To enhance safe LC learning, we developed an educational mobile game, LapBot Safe Chole, which uses an artificial intelligence (AI) model to provide real-time coaching and feedback, improving intraoperative decision-making. Methods:  LapBot Safe Chole offers a free, accessible simulated learning experience with real-time AI feedback. Players engage with intraoperative LC scenarios (short video clips) and identify ideal dissection zones. After the response, users receive an accuracy score from a validated AI algorithm. The game consists of 5 levels of increasing difficulty based on the Parkland grading scale for cholecystitis. Results:  Beta testing (n=29) showed score improvements with each round, with attendings and senior trainees achieving top scores faster than junior residents. Learning curves and progression distinguished candidates, with a significant association between user level and scores (P=.003). Players found LapBot enjoyable and educational. Conclusions:  LapBot Safe Chole effectively integrates safe LC principles into a fun, accessible, and educational game using AI-generated feedback. Initial beta testing supports the validity of the assessment scores and suggests high adoption and engagement potential among surgical trainees. UR - https://formative.jmir.org/2024/1/e52878 UR - http://dx.doi.org/10.2196/52878 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/52878 ER - TY - JOUR AU - Cherif, Hela AU - Moussa, Chirine AU - Missaoui, Mouhaymen Abdel AU - Salouage, Issam AU - Mokaddem, Salma AU - Dhahri, Besma PY - 2024/7/23 TI - Appraisal of ChatGPT?s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination JO - JMIR Med Educ SP - e52818 VL - 10 KW - medical education KW - ChatGPT KW - GPT KW - artificial intelligence KW - natural language processing KW - NLP KW - pulmonary medicine KW - pulmonary KW - lung KW - lungs KW - respiratory KW - respiration KW - pneumology KW - comparative analysis KW - large language models KW - LLMs KW - LLM KW - language model KW - generative AI KW - generative artificial intelligence KW - generative KW - exams KW - exam KW - examinations KW - examination N2 - Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT?s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution?s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. UR - https://mededu.jmir.org/2024/1/e52818 UR - http://dx.doi.org/10.2196/52818 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/52818 ER - TY - JOUR AU - Laymouna, Moustafa AU - Ma, Yuanchao AU - Lessard, David AU - Schuster, Tibor AU - Engler, Kim AU - Lebouché, Bertrand PY - 2024/7/23 TI - Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review JO - J Med Internet Res SP - e56930 VL - 26 KW - chatbot KW - conversational agent KW - conversational assistant KW - user-computer interface KW - digital health KW - mobile health KW - electronic health KW - telehealth KW - artificial intelligence KW - AI KW - health information technology N2 - Background: Chatbots, or conversational agents, have emerged as significant tools in health care, driven by advancements in artificial intelligence and digital technology. These programs are designed to simulate human conversations, addressing various health care needs. However, no comprehensive synthesis of health care chatbots? roles, users, benefits, and limitations is available to inform future research and application in the field. Objective: This review aims to describe health care chatbots? characteristics, focusing on their diverse roles in the health care pathway, user groups, benefits, and limitations. Methods: A rapid review of published literature from 2017 to 2023 was performed with a search strategy developed in collaboration with a health sciences librarian and implemented in the MEDLINE and Embase databases. Primary research studies reporting on chatbot roles or benefits in health care were included. Two reviewers dual-screened the search results. Extracted data on chatbot roles, users, benefits, and limitations were subjected to content analysis. Results: The review categorized chatbot roles into 2 themes: delivery of remote health services, including patient support, care management, education, skills building, and health behavior promotion, and provision of administrative assistance to health care providers. User groups spanned across patients with chronic conditions as well as patients with cancer; individuals focused on lifestyle improvements; and various demographic groups such as women, families, and older adults. Professionals and students in health care also emerged as significant users, alongside groups seeking mental health support, behavioral change, and educational enhancement. The benefits of health care chatbots were also classified into 2 themes: improvement of health care quality and efficiency and cost-effectiveness in health care delivery. The identified limitations encompassed ethical challenges, medicolegal and safety concerns, technical difficulties, user experience issues, and societal and economic impacts. Conclusions: Health care chatbots offer a wide spectrum of applications, potentially impacting various aspects of health care. While they are promising tools for improving health care efficiency and quality, their integration into the health care system must be approached with consideration of their limitations to ensure optimal, safe, and equitable use. UR - https://www.jmir.org/2024/1/e56930 UR - http://dx.doi.org/10.2196/56930 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/56930 ER - TY - JOUR AU - Tolentino, Raymond AU - Baradaran, Ashkan AU - Gore, Genevieve AU - Pluye, Pierre AU - Abbasgholizadeh-Rahimi, Samira PY - 2024/7/18 TI - Curriculum Frameworks and Educational Programs in AI for Medical Students, Residents, and Practicing Physicians: Scoping Review JO - JMIR Med Educ SP - e54793 VL - 10 KW - artificial intelligence KW - machine learning KW - curriculum KW - framework KW - medical education KW - review N2 - Background: The successful integration of artificial intelligence (AI) into clinical practice is contingent upon physicians? comprehension of AI principles and its applications. Therefore, it is essential for medical education curricula to incorporate AI topics and concepts, providing future physicians with the foundational knowledge and skills needed. However, there is a knowledge gap in the current understanding and availability of structured AI curriculum frameworks tailored for medical education, which serve as vital guides for instructing and facilitating the learning process. Objective: The overall aim of this study is to synthesize knowledge from the literature on curriculum frameworks and current educational programs that focus on the teaching and learning of AI for medical students, residents, and practicing physicians. Methods: We followed a validated framework and the Joanna Briggs Institute methodological guidance for scoping reviews. An information specialist performed a comprehensive search from 2000 to May 2023 in the following bibliographic databases: MEDLINE (Ovid), Embase (Ovid), CENTRAL (Cochrane Library), CINAHL (EBSCOhost), and Scopus as well as the gray literature. Papers were limited to English and French languages. This review included papers that describe curriculum frameworks for teaching and learning AI in medicine, irrespective of country. All types of papers and study designs were included, except conference abstracts and protocols. Two reviewers independently screened the titles and abstracts, read the full texts, and extracted data using a validated data extraction form. Disagreements were resolved by consensus, and if this was not possible, the opinion of a third reviewer was sought. We adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist for reporting the results. Results: Of the 5104 papers screened, 21 papers relevant to our eligibility criteria were identified. In total, 90% (19/21) of the papers altogether described 30 current or previously offered educational programs, and 10% (2/21) of the papers described elements of a curriculum framework. One framework describes a general approach to integrating AI curricula throughout the medical learning continuum and another describes a core curriculum for AI in ophthalmology. No papers described a theory, pedagogy, or framework that guided the educational programs. Conclusions: This review synthesizes recent advancements in AI curriculum frameworks and educational programs within the domain of medical education. To build on this foundation, future researchers are encouraged to engage in a multidisciplinary approach to curriculum redesign. In addition, it is encouraged to initiate dialogues on the integration of AI into medical curriculum planning and to investigate the development, deployment, and appraisal of these innovative educational programs. International Registered Report Identifier (IRRID): RR2-10.11124/JBIES-22-00374 UR - https://mededu.jmir.org/2024/1/e54793 UR - http://dx.doi.org/10.2196/54793 UR - http://www.ncbi.nlm.nih.gov/pubmed/39023999 ID - info:doi/10.2196/54793 ER - TY - JOUR AU - Jo, Eunbeen AU - Song, Sanghoun AU - Kim, Jong-Ho AU - Lim, Subin AU - Kim, Hyeon Ju AU - Cha, Jung-Joon AU - Kim, Young-Min AU - Joo, Joon Hyung PY - 2024/7/8 TI - Assessing GPT-4?s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts JO - JMIR Med Educ SP - e51282 VL - 10 KW - GPT-4 KW - medical advice KW - ChatGPT KW - cardiology KW - cardiologist KW - heart KW - advice KW - recommendation KW - recommendations KW - linguistic KW - linguistics KW - artificial intelligence KW - NLP KW - natural language processing KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents KW - response KW - responses N2 - Background: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI?s GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. Objective: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. Methods: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. Results: GPT-4 and human experts displayed comparable efficacy in medical accuracy (?GPT-4 is better? at 132/251, 52.6% vs ?Human expert is better? at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. Conclusions: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions. UR - https://mededu.jmir.org/2024/1/e51282 UR - http://dx.doi.org/10.2196/51282 ID - info:doi/10.2196/51282 ER - TY - JOUR AU - Hassanipour, Soheil AU - Nayak, Sandeep AU - Bozorgi, Ali AU - Keivanlou, Mohammad-Hossein AU - Dave, Tirth AU - Alotaibi, Abdulhadi AU - Joukar, Farahnaz AU - Mellatdoust, Parinaz AU - Bakhshi, Arash AU - Kuriyakose, Dona AU - Polisetty, D. Lakshmi AU - Chimpiri, Mallika AU - Amini-Salehi, Ehsan PY - 2024/7/8 TI - The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis JO - JMIR Med Educ SP - e53308 VL - 10 KW - ChatGPT KW - paraphrasing KW - text generation KW - prompts KW - academic journals KW - plagiarize KW - plagiarism KW - paraphrase KW - wording KW - LLM KW - LLMs KW - language model KW - language models KW - prompt KW - generative KW - artificial intelligence KW - NLP KW - natural language processing KW - rephrase KW - plagiarizing KW - honesty KW - integrity KW - texts KW - text KW - textual KW - generation KW - large language model KW - large language models N2 - Background: The introduction of ChatGPT by OpenAI has garnered significant attention. Among its capabilities, paraphrasing stands out. Objective: This study aims to investigate the satisfactory levels of plagiarism in the paraphrased text produced by this chatbot. Methods: Three texts of varying lengths were presented to ChatGPT. ChatGPT was then instructed to paraphrase the provided texts using five different prompts. In the subsequent stage of the study, the texts were divided into separate paragraphs, and ChatGPT was requested to paraphrase each paragraph individually. Lastly, in the third stage, ChatGPT was asked to paraphrase the texts it had previously generated. Results: The average plagiarism rate in the texts generated by ChatGPT was 45% (SD 10%). ChatGPT exhibited a substantial reduction in plagiarism for the provided texts (mean difference ?0.51, 95% CI ?0.54 to ?0.48; P<.001). Furthermore, when comparing the second attempt with the initial attempt, a significant decrease in the plagiarism rate was observed (mean difference ?0.06, 95% CI ?0.08 to ?0.03; P<.001). The number of paragraphs in the texts demonstrated a noteworthy association with the percentage of plagiarism, with texts consisting of a single paragraph exhibiting the lowest plagiarism rate (P<.001). Conclusion: Although ChatGPT demonstrates a notable reduction of plagiarism within texts, the existing levels of plagiarism remain relatively high. This underscores a crucial caution for researchers when incorporating this chatbot into their work. UR - https://mededu.jmir.org/2024/1/e53308 UR - http://dx.doi.org/10.2196/53308 ID - info:doi/10.2196/53308 ER - TY - JOUR AU - Shikino, Kiyoshi AU - Shimizu, Taro AU - Otsuka, Yuki AU - Tago, Masaki AU - Takahashi, Hiromizu AU - Watari, Takashi AU - Sasaki, Yosuke AU - Iizuka, Gemmei AU - Tamura, Hiroki AU - Nakashima, Koichi AU - Kunitomo, Kotaro AU - Suzuki, Morika AU - Aoyama, Sayaka AU - Kosaka, Shintaro AU - Kawahigashi, Teiko AU - Matsumoto, Tomohiro AU - Orihara, Fumina AU - Morikawa, Toru AU - Nishizawa, Toshinori AU - Hoshina, Yoji AU - Yamamoto, Yu AU - Matsuo, Yuichiro AU - Unoki, Yuto AU - Kimura, Hirofumi AU - Tokushima, Midori AU - Watanuki, Satoshi AU - Saito, Takuma AU - Otsuka, Fumio AU - Tokuda, Yasuharu PY - 2024/6/21 TI - Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research JO - JMIR Med Educ SP - e58758 VL - 10 KW - atypical presentation KW - ChatGPT KW - common disease KW - diagnostic accuracy KW - diagnosis KW - patient safety N2 - Background: The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations. Objective: This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model?s reliance on patient history during the diagnostic process. Methods: We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5). Results: ChatGPT?s diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The ?2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (?˛1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (?˛1=4.01; n=25; P=.048). Conclusions: ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings. UR - https://mededu.jmir.org/2024/1/e58758 UR - http://dx.doi.org/10.2196/58758 ID - info:doi/10.2196/58758 ER - TY - JOUR AU - Zhang, Fang AU - Liu, Xiaoliu AU - Wu, Wenyan AU - Zhu, Shiben PY - 2024/6/13 TI - Evolution of Chatbots in Nursing Education: Narrative Review JO - JMIR Med Educ SP - e54987 VL - 10 KW - nursing education KW - chatbots KW - artificial intelligence KW - narrative review KW - ChatGPT N2 - Background: The integration of chatbots in nursing education is a rapidly evolving area with potential transformative impacts. This narrative review aims to synthesize and analyze the existing literature on chatbots in nursing education. Objective: This study aims to comprehensively examine the temporal trends, international distribution, study designs, and implications of chatbots in nursing education. Methods: A comprehensive search was conducted across 3 databases (PubMed, Web of Science, and Embase) following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram. Results: A total of 40 articles met the eligibility criteria, with a notable increase of publications in 2023 (n=28, 70%). Temporal analysis revealed a notable surge in publications from 2021 to 2023, emphasizing the growing scholarly interest. Geographically, Taiwan province made substantial contributions (n=8, 20%), followed by the United States (n=6, 15%) and South Korea (n=4, 10%). Study designs varied, with reviews (n=8, 20%) and editorials (n=7, 18%) being predominant, showcasing the richness of research in this domain. Conclusions: Integrating chatbots into nursing education presents a promising yet relatively unexplored avenue. This review highlights the urgent need for original research, emphasizing the importance of ethical considerations. UR - https://mededu.jmir.org/2024/1/e54987 UR - http://dx.doi.org/10.2196/54987 ID - info:doi/10.2196/54987 ER - TY - JOUR AU - Srinivasan, Muthuvenkatachalam AU - Venugopal, Ambili AU - Venkatesan, Latha AU - Kumar, Rajesh PY - 2024/6/13 TI - Navigating the Pedagogical Landscape: Exploring the Implications of AI and Chatbots in Nursing Education JO - JMIR Nursing SP - e52105 VL - 7 KW - AI KW - artificial intelligence KW - ChatGPT KW - chatbots KW - nursing education KW - education KW - chatbot KW - nursing KW - ethical KW - ethics KW - ethical consideration KW - accessible KW - learning KW - efficiency KW - student KW - student engagement KW - student learning UR - https://nursing.jmir.org/2024/1/e52105 UR - http://dx.doi.org/10.2196/52105 UR - http://www.ncbi.nlm.nih.gov/pubmed/38870516 ID - info:doi/10.2196/52105 ER - TY - JOUR AU - Moldt, Julia-Astrid AU - Festl-Wietek, Teresa AU - Fuhl, Wolfgang AU - Zabel, Susanne AU - Claassen, Manfred AU - Wagner, Samuel AU - Nieselt, Kay AU - Herrmann-Werner, Anne PY - 2024/6/12 TI - Assessing AI Awareness and Identifying Essential Competencies: Insights From Key Stakeholders in Integrating AI Into Medical Education JO - JMIR Med Educ SP - e58355 VL - 10 KW - AI in medicine KW - artificial intelligence KW - medical education KW - medical students KW - qualitative approach KW - qualitative analysis KW - needs assessment N2 - Background: The increasing importance of artificial intelligence (AI) in health care has generated a growing need for health care professionals to possess a comprehensive understanding of AI technologies, requiring an adaptation in medical education. Objective: This paper explores stakeholder perceptions and expectations regarding AI in medicine and examines their potential impact on the medical curriculum. This study project aims to assess the AI experiences and awareness of different stakeholders and identify essential AI-related topics in medical education to define necessary competencies for students. Methods: The empirical data were collected as part of the TüKITZMed project between August 2022 and March 2023, using a semistructured qualitative interview. These interviews were administered to a diverse group of stakeholders to explore their experiences and perspectives of AI in medicine. A qualitative content analysis of the collected data was conducted using MAXQDA software. Results: Semistructured interviews were conducted with 38 participants (6 lecturers, 9 clinicians, 10 students, 6 AI experts, and 7 institutional stakeholders). The qualitative content analysis revealed 6 primary categories with a total of 24 subcategories to answer the research questions. The evaluation of the stakeholders? statements revealed several commonalities and differences regarding their understanding of AI. Crucial identified AI themes based on the main categories were as follows: possible curriculum contents, skills, and competencies; programming skills; curriculum scope; and curriculum structure. Conclusions: The analysis emphasizes integrating AI into medical curricula to ensure students? proficiency in clinical applications. Standardized AI comprehension is crucial for defining and teaching relevant content. Considering diverse perspectives in implementation is essential to comprehensively define AI in the medical context, addressing gaps and facilitating effective solutions for future AI use in medical studies. The results provide insights into potential curriculum content and structure, including aspects of AI in medicine. UR - https://mededu.jmir.org/2024/1/e58355 UR - http://dx.doi.org/10.2196/58355 ID - info:doi/10.2196/58355 ER - TY - JOUR AU - Arango-Ibanez, Pablo Juan AU - Posso-Nuņez, Alejandro Jose AU - Díaz-Solķrzano, Pablo Juan AU - Cruz-Suárez, Gustavo PY - 2024/5/24 TI - Evidence-Based Learning Strategies in Medicine Using AI JO - JMIR Med Educ SP - e54507 VL - 10 KW - artificial intelligence KW - large language models KW - ChatGPT KW - active recall KW - memory cues KW - LLMs KW - evidence-based KW - learning strategy KW - medicine KW - AI KW - medical education KW - knowledge KW - relevance UR - https://mededu.jmir.org/2024/1/e54507 UR - http://dx.doi.org/10.2196/54507 ID - info:doi/10.2196/54507 ER - TY - JOUR AU - Takagi, Soshi AU - Koda, Masahide AU - Watari, Takashi PY - 2024/5/23 TI - The Performance of ChatGPT-4V in Interpreting Images and Tables in the Japanese Medical Licensing Exam JO - JMIR Med Educ SP - e54283 VL - 10 KW - ChatGPT KW - medical licensing examination KW - generative artificial intelligence KW - medical education KW - large language model KW - images KW - tables KW - artificial intelligence KW - AI KW - Japanese KW - reliability KW - medical application KW - medical applications KW - diagnostic KW - diagnostics KW - online data KW - web-based data UR - https://mededu.jmir.org/2024/1/e54283 UR - http://dx.doi.org/10.2196/54283 ID - info:doi/10.2196/54283 ER - TY - JOUR AU - Wang, Shangqiguo AU - Mo, Changgeng AU - Chen, Yuan AU - Dai, Xiaolu AU - Wang, Huiyi AU - Shen, Xiaoli PY - 2024/4/26 TI - Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care JO - JMIR Med Educ SP - e55595 VL - 10 KW - ChatGPT KW - medical education KW - artificial intelligence KW - AI KW - audiology KW - hearing care KW - natural language processing KW - large language model KW - Taiwan KW - hearing KW - hearing specialist KW - audiologist KW - examination KW - information accuracy KW - educational technology KW - healthcare services KW - chatbot KW - health care services N2 - Background: Artificial intelligence (AI) chatbots, such as ChatGPT-4, have shown immense potential for application across various aspects of medicine, including medical education, clinical practice, and research. Objective: This study aimed to evaluate the performance of ChatGPT-4 in the 2023 Taiwan Audiologist Qualification Examination, thereby preliminarily exploring the potential utility of AI chatbots in the fields of audiology and hearing care services. Methods: ChatGPT-4 was tasked to provide answers and reasoning for the 2023 Taiwan Audiologist Qualification Examination. The examination encompassed six subjects: (1) basic auditory science, (2) behavioral audiology, (3) electrophysiological audiology, (4) principles and practice of hearing devices, (5) health and rehabilitation of the auditory and balance systems, and (6) auditory and speech communication disorders (including professional ethics). Each subject included 50 multiple-choice questions, with the exception of behavioral audiology, which had 49 questions, amounting to a total of 299 questions. Results: The correct answer rates across the 6 subjects were as follows: 88% for basic auditory science, 63% for behavioral audiology, 58% for electrophysiological audiology, 72% for principles and practice of hearing devices, 80% for health and rehabilitation of the auditory and balance systems, and 86% for auditory and speech communication disorders (including professional ethics). The overall accuracy rate for the 299 questions was 75%, which surpasses the examination?s passing criteria of an average 60% accuracy rate across all subjects. A comprehensive review of ChatGPT-4?s responses indicated that incorrect answers were predominantly due to information errors. Conclusions: ChatGPT-4 demonstrated a robust performance in the Taiwan Audiologist Qualification Examination, showcasing effective logical reasoning skills. Our results suggest that with enhanced information accuracy, ChatGPT-4?s performance could be further improved. This study indicates significant potential for the application of AI chatbots in audiology and hearing care services. UR - https://mededu.jmir.org/2024/1/e55595 UR - http://dx.doi.org/10.2196/55595 ID - info:doi/10.2196/55595 ER - TY - JOUR AU - Choudhury, Avishek AU - Chaudhry, Zaira PY - 2024/4/25 TI - Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals JO - J Med Internet Res SP - e56764 VL - 26 KW - trust KW - ChatGPT KW - human factors KW - healthcare KW - LLMs KW - large language models KW - LLM user trust KW - AI accountability KW - artificial intelligence KW - AI technology KW - technologies KW - effectiveness KW - policy KW - medical student KW - medical students KW - risk factor KW - quality of care KW - healthcare professional KW - healthcare professionals KW - human element UR - https://www.jmir.org/2024/1/e56764 UR - http://dx.doi.org/10.2196/56764 UR - http://www.ncbi.nlm.nih.gov/pubmed/38662419 ID - info:doi/10.2196/56764 ER - TY - JOUR AU - Wu, Yijun AU - Zheng, Yue AU - Feng, Baijie AU - Yang, Yuqi AU - Kang, Kai AU - Zhao, Ailin PY - 2024/4/10 TI - Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students JO - JMIR Med Educ SP - e52483 VL - 10 KW - artificial intelligence KW - AI KW - ChatGPT KW - medical education KW - doctors KW - medical students UR - https://mededu.jmir.org/2024/1/e52483 UR - http://dx.doi.org/10.2196/52483 UR - http://www.ncbi.nlm.nih.gov/pubmed/38598263 ID - info:doi/10.2196/52483 ER - TY - JOUR AU - Fukuzawa, Fumitoshi AU - Yanagita, Yasutaka AU - Yokokawa, Daiki AU - Uchida, Shun AU - Yamashita, Shiho AU - Li, Yu AU - Shikino, Kiyoshi AU - Tsukamoto, Tomoko AU - Noda, Kazutaka AU - Uehara, Takanori AU - Ikusaka, Masatomi PY - 2024/4/8 TI - Importance of Patient History in Artificial Intelligence?Assisted Medical Diagnosis: Comparison Study JO - JMIR Med Educ SP - e52674 VL - 10 KW - medical diagnosis KW - ChatGPT KW - AI in medicine KW - diagnostic accuracy KW - patient history KW - medical history KW - artificial intelligence KW - AI KW - physical examination KW - physical examinations KW - laboratory investigation KW - laboratory investigations KW - mHealth KW - accuracy KW - public health KW - United States KW - AI diagnosis KW - treatment KW - male KW - female KW - child KW - children KW - youth KW - adolescent KW - adolescents KW - teen KW - teens KW - teenager KW - teenagers KW - older adult KW - older adults KW - elder KW - elderly KW - older person KW - older people KW - investigative KW - mobile health KW - digital health N2 - Background: Medical history contributes approximately 80% to a diagnosis, although physical examinations and laboratory investigations increase a physician?s confidence in the medical diagnosis. The concept of artificial intelligence (AI) was first proposed more than 70 years ago. Recently, its role in various fields of medicine has grown remarkably. However, no studies have evaluated the importance of patient history in AI-assisted medical diagnosis. Objective: This study explored the contribution of patient history to AI-assisted medical diagnoses and assessed the accuracy of ChatGPT in reaching a clinical diagnosis based on the medical history provided. Methods: Using clinical vignettes of 30 cases identified in The BMJ, we evaluated the accuracy of diagnoses generated by ChatGPT. We compared the diagnoses made by ChatGPT based solely on medical history with the correct diagnoses. We also compared the diagnoses made by ChatGPT after incorporating additional physical examination findings and laboratory data alongside history with the correct diagnoses. Results: ChatGPT accurately diagnosed 76.6% (23/30) of the cases with only the medical history, consistent with previous research targeting physicians. We also found that this rate was 93.3% (28/30) when additional information was included. Conclusions: Although adding additional information improves diagnostic accuracy, patient history remains a significant factor in AI-assisted medical diagnosis. Thus, when using AI in medical diagnosis, it is crucial to include pertinent and correct patient histories for an accurate diagnosis. Our findings emphasize the continued significance of patient history in clinical diagnoses in this age and highlight the need for its integration into AI-assisted medical diagnosis systems. UR - https://mededu.jmir.org/2024/1/e52674 UR - http://dx.doi.org/10.2196/52674 ID - info:doi/10.2196/52674 ER - TY - JOUR AU - Noda, Masao AU - Ueno, Takayoshi AU - Koshu, Ryota AU - Takaso, Yuji AU - Shimada, Dias Mari AU - Saito, Chizu AU - Sugimoto, Hisashi AU - Fushiki, Hiroaki AU - Ito, Makoto AU - Nomura, Akihiro AU - Yoshizaki, Tomokazu PY - 2024/3/28 TI - Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study JO - JMIR Med Educ SP - e57054 VL - 10 KW - artificial intelligence KW - GPT-4v KW - large language model KW - otolaryngology KW - GPT KW - ChatGPT KW - LLM KW - LLMs KW - language model KW - language models KW - head KW - respiratory KW - ENT: ear KW - nose KW - throat KW - neck KW - NLP KW - natural language processing KW - image KW - images KW - exam KW - exams KW - examination KW - examinations KW - answer KW - answers KW - answering KW - response KW - responses N2 - Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence?s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. UR - https://mededu.jmir.org/2024/1/e57054 UR - http://dx.doi.org/10.2196/57054 UR - http://www.ncbi.nlm.nih.gov/pubmed/38546736 ID - info:doi/10.2196/57054 ER - TY - JOUR AU - Gandhi, P. Aravind AU - Joesph, Karen Felista AU - Rajagopal, Vineeth AU - Aparnavi, P. AU - Katkuri, Sushma AU - Dayama, Sonal AU - Satapathy, Prakasini AU - Khatib, Nazli Mahalaqua AU - Gaidhane, Shilpa AU - Zahiruddin, Syed Quazi AU - Behera, Ashish PY - 2024/3/25 TI - Performance of ChatGPT on the India Undergraduate Community Medicine Examination: Cross-Sectional Study JO - JMIR Form Res SP - e49964 VL - 8 KW - artificial intelligence KW - ChatGPT KW - community medicine KW - India KW - large language model KW - medical education KW - digitalization N2 - Background: Medical students may increasingly use large language models (LLMs) in their learning. ChatGPT is an LLM at the forefront of this new development in medical education with the capacity to respond to multidisciplinary questions. Objective: The aim of this study was to evaluate the ability of ChatGPT 3.5 to complete the Indian undergraduate medical examination in the subject of community medicine. We further compared ChatGPT scores with the scores obtained by the students. Methods: The study was conducted at a publicly funded medical college in Hyderabad, India. The study was based on the internal assessment examination conducted in January 2023 for students in the Bachelor of Medicine and Bachelor of Surgery Final Year?Part I program; the examination of focus included 40 questions (divided between two papers) from the community medicine subject syllabus. Each paper had three sections with different weightage of marks for each section: section one had two long essay?type questions worth 15 marks each, section two had 8 short essay?type questions worth 5 marks each, and section three had 10 short-answer questions worth 3 marks each. The same questions were administered as prompts to ChatGPT 3.5 and the responses were recorded. Apart from scoring ChatGPT responses, two independent evaluators explored the responses to each question to further analyze their quality with regard to three subdomains: relevancy, coherence, and completeness. Each question was scored in these subdomains on a Likert scale of 1-5. The average of the two evaluators was taken as the subdomain score of the question. The proportion of questions with a score 50% of the maximum score (5) in each subdomain was calculated. Results: ChatGPT 3.5 scored 72.3% on paper 1 and 61% on paper 2. The mean score of the 94 students was 43% on paper 1 and 45% on paper 2. The responses of ChatGPT 3.5 were also rated to be satisfactorily relevant, coherent, and complete for most of the questions (>80%). Conclusions: ChatGPT 3.5 appears to have substantial and sufficient knowledge to understand and answer the Indian medical undergraduate examination in the subject of community medicine. ChatGPT may be introduced to students to enable the self-directed learning of community medicine in pilot mode. However, faculty oversight will be required as ChatGPT is still in the initial stages of development, and thus its potential and reliability of medical content from the Indian context need to be further explored comprehensively. UR - https://formative.jmir.org/2024/1/e49964 UR - http://dx.doi.org/10.2196/49964 UR - http://www.ncbi.nlm.nih.gov/pubmed/38526538 ID - info:doi/10.2196/49964 ER - TY - JOUR AU - Magalhães Araujo, Sabrina AU - Cruz-Correia, Ricardo PY - 2024/3/20 TI - Incorporating ChatGPT in Medical Informatics Education: Mixed Methods Study on Student Perceptions and Experiential Integration Proposals JO - JMIR Med Educ SP - e51151 VL - 10 KW - education KW - medical informatics KW - artificial intelligence KW - AI KW - generative language model KW - ChatGPT N2 - Background: The integration of artificial intelligence (AI) technologies, such as ChatGPT, in the educational landscape has the potential to enhance the learning experience of medical informatics students and prepare them for using AI in professional settings. The incorporation of AI in classes aims to develop critical thinking by encouraging students to interact with ChatGPT and critically analyze the responses generated by the chatbot. This approach also helps students develop important skills in the field of biomedical and health informatics to enhance their interaction with AI tools. Objective: The aim of the study is to explore the perceptions of students regarding the use of ChatGPT as a learning tool in their educational context and provide professors with examples of prompts for incorporating ChatGPT into their teaching and learning activities, thereby enhancing the educational experience for students in medical informatics courses. Methods: This study used a mixed methods approach to gain insights from students regarding the use of ChatGPT in education. To accomplish this, a structured questionnaire was applied to evaluate students? familiarity with ChatGPT, gauge their perceptions of its use, and understand their attitudes toward its use in academic and learning tasks. Learning outcomes of 2 courses were analyzed to propose ChatGPT?s incorporation in master?s programs in medicine and medical informatics. Results: The majority of students expressed satisfaction with the use of ChatGPT in education, finding it beneficial for various purposes, including generating academic content, brainstorming ideas, and rewriting text. While some participants raised concerns about potential biases and the need for informed use, the overall perception was positive. Additionally, the study proposed integrating ChatGPT into 2 specific courses in the master?s programs in medicine and medical informatics. The incorporation of ChatGPT was envisioned to enhance student learning experiences and assist in project planning, programming code generation, examination preparation, workflow exploration, and technical interview preparation, thus advancing medical informatics education. In medical teaching, it will be used as an assistant for simplifying the explanation of concepts and solving complex problems, as well as for generating clinical narratives and patient simulators. Conclusions: The study?s valuable insights into medical faculty students? perspectives and integration proposals for ChatGPT serve as an informative guide for professors aiming to enhance medical informatics education. The research delves into the potential of ChatGPT, emphasizes the necessity of collaboration in academic environments, identifies subject areas with discernible benefits, and underscores its transformative role in fostering innovative and engaging learning experiences. The envisaged proposals hold promise in empowering future health care professionals to work in the rapidly evolving era of digital health care. UR - https://mededu.jmir.org/2024/1/e51151 UR - http://dx.doi.org/10.2196/51151 UR - http://www.ncbi.nlm.nih.gov/pubmed/38506920 ID - info:doi/10.2196/51151 ER - TY - JOUR AU - Nakao, Takahiro AU - Miki, Soichiro AU - Nakamura, Yuta AU - Kikuchi, Tomohiro AU - Nomura, Yukihiro AU - Hanaoka, Shouhei AU - Yoshikawa, Takeharu AU - Abe, Osamu PY - 2024/3/12 TI - Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study JO - JMIR Med Educ SP - e54393 VL - 10 KW - AI KW - artificial intelligence KW - LLM KW - large language model KW - language model KW - language models KW - ChatGPT KW - GPT-4 KW - GPT-4V KW - generative pretrained transformer KW - image KW - images KW - imaging KW - response KW - responses KW - exam KW - examination KW - exams KW - examinations KW - answer KW - answers KW - NLP KW - natural language processing KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents KW - medical education N2 - Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V?s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P?.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. UR - https://mededu.jmir.org/2024/1/e54393 UR - http://dx.doi.org/10.2196/54393 UR - http://www.ncbi.nlm.nih.gov/pubmed/38470459 ID - info:doi/10.2196/54393 ER - TY - JOUR AU - Willms, Amanda AU - Liu, Sam PY - 2024/2/29 TI - Exploring the Feasibility of Using ChatGPT to Create Just-in-Time Adaptive Physical Activity mHealth Intervention Content: Case Study JO - JMIR Med Educ SP - e51426 VL - 10 KW - ChatGPT KW - digital health KW - mobile health KW - mHealth KW - physical activity KW - application KW - mobile app KW - mobile apps KW - content creation KW - behavior change KW - app design N2 - Background: Achieving physical activity (PA) guidelines? recommendation of 150 minutes of moderate-to-vigorous PA per week has been shown to reduce the risk of many chronic conditions. Despite the overwhelming evidence in this field, PA levels remain low globally. By creating engaging mobile health (mHealth) interventions through strategies such as just-in-time adaptive interventions (JITAIs) that are tailored to an individual?s dynamic state, there is potential to increase PA levels. However, generating personalized content can take a long time due to various versions of content required for the personalization algorithms. ChatGPT presents an incredible opportunity to rapidly produce tailored content; however, there is a lack of studies exploring its feasibility. Objective: This study aimed to (1) explore the feasibility of using ChatGPT to create content for a PA JITAI mobile app and (2) describe lessons learned and future recommendations for using ChatGPT in the development of mHealth JITAI content. Methods: During phase 1, we used Pathverse, a no-code app builder, and ChatGPT to develop a JITAI app to help parents support their child?s PA levels. The intervention was developed based on the Multi-Process Action Control (M-PAC) framework, and the necessary behavior change techniques targeting the M-PAC constructs were implemented in the app design to help parents support their child?s PA. The acceptability of using ChatGPT for this purpose was discussed to determine its feasibility. In phase 2, we summarized the lessons we learned during the JITAI content development process using ChatGPT and generated recommendations to inform future similar use cases. Results: In phase 1, by using specific prompts, we efficiently generated content for 13 lessons relating to increasing parental support for their child?s PA following the M-PAC framework. It was determined that using ChatGPT for this case study to develop PA content for a JITAI was acceptable. In phase 2, we summarized our recommendations into the following six steps when using ChatGPT to create content for mHealth behavior interventions: (1) determine target behavior, (2) ground the intervention in behavior change theory, (3) design the intervention structure, (4) input intervention structure and behavior change constructs into ChatGPT, (5) revise the ChatGPT response, and (6) customize the response to be used in the intervention. Conclusions: ChatGPT offers a remarkable opportunity for rapid content creation in the context of an mHealth JITAI. Although our case study demonstrated that ChatGPT was acceptable, it is essential to approach its use, along with other language models, with caution. Before delivering content to population groups, expert review is crucial to ensure accuracy and relevancy. Future research and application of these guidelines are imperative as we deepen our understanding of ChatGPT and its interactions with human input. UR - https://mededu.jmir.org/2024/1/e51426 UR - http://dx.doi.org/10.2196/51426 UR - http://www.ncbi.nlm.nih.gov/pubmed/38421689 ID - info:doi/10.2196/51426 ER - TY - JOUR AU - Farhat, Faiza AU - Chaudhry, Moalla Beenish AU - Nadeem, Mohammad AU - Sohail, Saquib Shahab AU - Madsen, Øivind Dag PY - 2024/2/21 TI - Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard JO - JMIR Med Educ SP - e51523 VL - 10 KW - accuracy KW - AI model KW - artificial intelligence KW - Bard KW - ChatGPT KW - educational task KW - GPT-4 KW - Generative Pre-trained Transformers KW - large language models KW - medical education, medical exam KW - natural language processing KW - performance KW - premedical exams KW - suitability N2 - Background: Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India. Objective: This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. Methods: In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models. Results: It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59. Conclusions: The study?s findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs? performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments. UR - https://mededu.jmir.org/2024/1/e51523 UR - http://dx.doi.org/10.2196/51523 UR - http://www.ncbi.nlm.nih.gov/pubmed/38381486 ID - info:doi/10.2196/51523 ER - TY - JOUR AU - Abid, Areeba AU - Murugan, Avinash AU - Banerjee, Imon AU - Purkayastha, Saptarshi AU - Trivedi, Hari AU - Gichoya, Judy PY - 2024/2/20 TI - AI Education for Fourth-Year Medical Students: Two-Year Experience of a Web-Based, Self-Guided Curriculum and Mixed Methods Study JO - JMIR Med Educ SP - e46500 VL - 10 KW - medical education KW - machine learning KW - artificial intelligence KW - elective curriculum KW - medical student KW - student KW - students KW - elective KW - electives KW - curricula KW - curriculum KW - lesson plan KW - lesson plans KW - educators KW - educator KW - teacher KW - teachers KW - teaching KW - computer programming KW - programming KW - coding KW - programmer KW - programmers KW - self guided KW - self directed N2 - Background: Artificial intelligence (AI) and machine learning (ML) are poised to have a substantial impact in the health care space. While a plethora of web-based resources exist to teach programming skills and ML model development, there are few introductory curricula specifically tailored to medical students without a background in data science or programming. Programs that do exist are often restricted to a specific specialty. Objective: We hypothesized that a 1-month elective for fourth-year medical students, composed of high-quality existing web-based resources and a project-based structure, would empower students to learn about the impact of AI and ML in their chosen specialty and begin contributing to innovation in their field of interest. This study aims to evaluate the success of this elective in improving self-reported confidence scores in AI and ML. The authors also share our curriculum with other educators who may be interested in its adoption. Methods: This elective was offered in 2 tracks: technical (for students who were already competent programmers) and nontechnical (with no technical prerequisites, focusing on building a conceptual understanding of AI and ML). Students established a conceptual foundation of knowledge using curated web-based resources and relevant research papers, and were then tasked with completing 3 projects in their chosen specialty: a data set analysis, a literature review, and an AI project proposal. The project-based nature of the elective was designed to be self-guided and flexible to each student?s interest area and career goals. Students? success was measured by self-reported confidence in AI and ML skills in pre and postsurveys. Qualitative feedback on students? experiences was also collected. Results: This web-based, self-directed elective was offered on a pass-or-fail basis each month to fourth-year students at Emory University School of Medicine beginning in May 2021. As of June 2022, a total of 19 students had successfully completed the elective, representing a wide range of chosen specialties: diagnostic radiology (n=3), general surgery (n=1), internal medicine (n=5), neurology (n=2), obstetrics and gynecology (n=1), ophthalmology (n=1), orthopedic surgery (n=1), otolaryngology (n=2), pathology (n=2), and pediatrics (n=1). Students? self-reported confidence scores for AI and ML rose by 66% after this 1-month elective. In qualitative surveys, students overwhelmingly reported enthusiasm and satisfaction with the course and commented that the self-direction and flexibility and the project-based design of the course were essential. Conclusions: Course participants were successful in diving deep into applications of AI in their widely-ranging specialties, produced substantial project deliverables, and generally reported satisfaction with their elective experience. The authors are hopeful that a brief, 1-month investment in AI and ML education during medical school will empower this next generation of physicians to pave the way for AI and ML innovation in health care. UR - https://mededu.jmir.org/2024/1/e46500 UR - http://dx.doi.org/10.2196/46500 UR - http://www.ncbi.nlm.nih.gov/pubmed/38376896 ID - info:doi/10.2196/46500 ER - TY - JOUR AU - Weidener, Lukas AU - Fischer, Michael PY - 2024/2/9 TI - Proposing a Principle-Based Approach for Teaching AI Ethics in Medical Education JO - JMIR Med Educ SP - e55368 VL - 10 KW - artificial intelligence KW - AI KW - ethics KW - artificial intelligence ethics KW - AI ethics KW - medical education KW - medicine KW - medical artificial intelligence ethics KW - medical AI ethics KW - medical ethics KW - public health ethics UR - https://mededu.jmir.org/2024/1/e55368 UR - http://dx.doi.org/10.2196/55368 UR - http://www.ncbi.nlm.nih.gov/pubmed/38285931 ID - info:doi/10.2196/55368 ER - TY - JOUR AU - Gray, Megan AU - Baird, Austin AU - Sawyer, Taylor AU - James, Jasmine AU - DeBroux, Thea AU - Bartlett, Michelle AU - Krick, Jeanne AU - Umoren, Rachel PY - 2024/2/1 TI - Increasing Realism and Variety of Virtual Patient Dialogues for Prenatal Counseling Education Through a Novel Application of ChatGPT: Exploratory Observational Study JO - JMIR Med Educ SP - e50705 VL - 10 KW - prenatal counseling KW - virtual health KW - virtual patient KW - simulation KW - neonatology KW - ChatGPT KW - AI KW - artificial intelligence N2 - Background: Using virtual patients, facilitated by natural language processing, provides a valuable educational experience for learners. Generating a large, varied sample of realistic and appropriate responses for virtual patients is challenging. Artificial intelligence (AI) programs can be a viable source for these responses, but their utility for this purpose has not been explored. Objective: In this study, we explored the effectiveness of generative AI (ChatGPT) in developing realistic virtual standardized patient dialogues to teach prenatal counseling skills. Methods: ChatGPT was prompted to generate a list of common areas of concern and questions that families expecting preterm delivery at 24 weeks gestation might ask during prenatal counseling. ChatGPT was then prompted to generate 2 role-plays with dialogues between a parent expecting a potential preterm delivery at 24 weeks and their counseling physician using each of the example questions. The prompt was repeated for 2 unique role-plays: one parent was characterized as anxious and the other as having low trust in the medical system. Role-play scripts were exported verbatim and independently reviewed by 2 neonatologists with experience in prenatal counseling, using a scale of 1-5 on realism, appropriateness, and utility for virtual standardized patient responses. Results: ChatGPT generated 7 areas of concern, with 35 example questions used to generate role-plays. The 35 role-play transcripts generated 176 unique parent responses (median 5, IQR 4-6, per role-play) with 268 unique sentences. Expert review identified 117 (65%) of the 176 responses as indicating an emotion, either directly or indirectly. Approximately half (98/176, 56%) of the responses had 2 or more sentences, and half (88/176, 50%) included at least 1 question. More than half (104/176, 58%) of the responses from role-played parent characters described a feeling, such as being scared, worried, or concerned. The role-plays of parents with low trust in the medical system generated many unique sentences (n=50). Most of the sentences in the responses were found to be reasonably realistic (214/268, 80%), appropriate for variable prenatal counseling conversation paths (233/268, 87%), and usable without more than a minimal modification in a virtual patient program (169/268, 63%). Conclusions: Generative AI programs, such as ChatGPT, may provide a viable source of training materials to expand virtual patient programs, with careful attention to the concerns and questions of patients and families. Given the potential for unrealistic or inappropriate statements and questions, an expert should review AI chat outputs before deploying them in an educational program. UR - https://mededu.jmir.org/2024/1/e50705 UR - http://dx.doi.org/10.2196/50705 UR - http://www.ncbi.nlm.nih.gov/pubmed/38300696 ID - info:doi/10.2196/50705 ER - TY - JOUR AU - Haddad, Firas AU - Saade, S. Joanna PY - 2024/1/18 TI - Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study JO - JMIR Med Educ SP - e50842 VL - 10 KW - ChatGPT KW - artificial intelligence KW - AI KW - board examinations KW - ophthalmology KW - testing N2 - Background: ChatGPT and language learning models have gained attention recently for their ability to answer questions on various examinations across various disciplines. The question of whether ChatGPT could be used to aid in medical education is yet to be answered, particularly in the field of ophthalmology. Objective: The aim of this study is to assess the ability of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4.0 (GPT-4.0) to answer ophthalmology-related questions across different levels of ophthalmology training. Methods: Questions from the United States Medical Licensing Examination (USMLE) steps 1 (n=44), 2 (n=60), and 3 (n=28) were extracted from AMBOSS, and 248 questions (64 easy, 122 medium, and 62 difficult questions) were extracted from the book, Ophthalmology Board Review Q&A, for the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology (OB) Written Qualifying Examination (WQE). Questions were prompted identically and inputted to GPT-3.5 and GPT-4.0. Results: GPT-3.5 achieved a total of 55% (n=210) of correct answers, while GPT-4.0 achieved a total of 70% (n=270) of correct answers. GPT-3.5 answered 75% (n=33) of questions correctly in USMLE step 1, 73.33% (n=44) in USMLE step 2, 60.71% (n=17) in USMLE step 3, and 46.77% (n=116) in the OB-WQE. GPT-4.0 answered 70.45% (n=31) of questions correctly in USMLE step 1, 90.32% (n=56) in USMLE step 2, 96.43% (n=27) in USMLE step 3, and 62.90% (n=156) in the OB-WQE. GPT-3.5 performed poorer as examination levels advanced (P<.001), while GPT-4.0 performed better on USMLE steps 2 and 3 and worse on USMLE step 1 and the OB-WQE (P<.001). The coefficient of correlation (r) between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 as compared to ?0.31 (P<.001) for GPT-4.0. GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with an increase in the difficulty level. Both GPT models performed significantly better on certain topics than on others. Conclusions: ChatGPT is far from being considered a part of mainstream medical education. Future models with higher accuracy are needed for the platform to be effective in medical education. UR - https://mededu.jmir.org/2024/1/e50842 UR - http://dx.doi.org/10.2196/50842 UR - http://www.ncbi.nlm.nih.gov/pubmed/38236632 ID - info:doi/10.2196/50842 ER - TY - JOUR AU - Kuo, I-Hsien Nicholas AU - Perez-Concha, Oscar AU - Hanly, Mark AU - Mnatzaganian, Emmanuel AU - Hao, Brandon AU - Di Sipio, Marcus AU - Yu, Guolin AU - Vanjara, Jash AU - Valerie, Cerelia Ivy AU - de Oliveira Costa, Juliana AU - Churches, Timothy AU - Lujic, Sanja AU - Hegarty, Jo AU - Jorm, Louisa AU - Barbieri, Sebastiano PY - 2024/1/16 TI - Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project JO - JMIR Med Educ SP - e51388 VL - 10 KW - medical education KW - generative model KW - generative adversarial networks KW - privacy KW - antiretroviral therapy (ART) KW - human immunodeficiency virus (HIV) KW - data science KW - educational purposes KW - accessibility KW - data privacy KW - data sets KW - sepsis KW - hypotension KW - HIV KW - science education KW - health care AI UR - https://mededu.jmir.org/2024/1/e51388 UR - http://dx.doi.org/10.2196/51388 UR - http://www.ncbi.nlm.nih.gov/pubmed/38227356 ID - info:doi/10.2196/51388 ER - TY - JOUR AU - Knoedler, Leonard AU - Alfertshofer, Michael AU - Knoedler, Samuel AU - Hoch, C. Cosima AU - Funk, F. Paul AU - Cotofana, Sebastian AU - Maheta, Bhagvat AU - Frank, Konstantin AU - Brébant, Vanessa AU - Prantl, Lukas AU - Lamby, Philipp PY - 2024/1/5 TI - Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis JO - JMIR Med Educ SP - e51148 VL - 10 KW - ChatGPT KW - United States Medical Licensing Examination KW - artificial intelligence KW - USMLE KW - USMLE Step 1 KW - OpenAI KW - medical education KW - clinical decision-making N2 - Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student?s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT?s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective: This paper aimed to analyze ChatGPT?s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (?=?0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ?=?0.289 for ChatGPT 3.5 and ?=?0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics. UR - https://mededu.jmir.org/2024/1/e51148 UR - http://dx.doi.org/10.2196/51148 UR - http://www.ncbi.nlm.nih.gov/pubmed/38180782 ID - info:doi/10.2196/51148 ER - TY - JOUR AU - Watari, Takashi AU - Takagi, Soshi AU - Sakaguchi, Kota AU - Nishizaki, Yuji AU - Shimizu, Taro AU - Yamamoto, Yu AU - Tokuda, Yasuharu PY - 2023/12/6 TI - Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study JO - JMIR Med Educ SP - e52202 VL - 9 KW - ChatGPT KW - artificial intelligence KW - medical education KW - clinical training KW - non-English language KW - ChatGPT-4 KW - Japan KW - Japanese KW - Asia KW - Asian KW - exam KW - examination KW - exams KW - examinations KW - NLP KW - natural language processing KW - LLM KW - language model KW - language models KW - performance KW - response KW - responses KW - answer KW - answers KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents KW - reasoning KW - clinical KW - GM-ITE KW - self-assessment KW - residency programs N2 - Background: The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. Objective: This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). Methods: We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents? correct response rates. Results: Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the ?specific diseases,? 30.9 points higher in ?obstetrics and gynecology,? and 26.1 points higher in ?internal medicine.? In contrast, GPT-4 scores in ?medical interviewing and professionalism,? ?general practice,? and ?psychiatry? were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference). Conclusions: In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice. UR - https://mededu.jmir.org/2023/1/e52202 UR - http://dx.doi.org/10.2196/52202 UR - http://www.ncbi.nlm.nih.gov/pubmed/38055323 ID - info:doi/10.2196/52202 ER - TY - JOUR AU - Shimizu, Ikuo AU - Kasai, Hajime AU - Shikino, Kiyoshi AU - Araki, Nobuyuki AU - Takahashi, Zaiya AU - Onodera, Misaki AU - Kimura, Yasuhiko AU - Tsukamoto, Tomoko AU - Yamauchi, Kazuyo AU - Asahina, Mayumi AU - Ito, Shoichi AU - Kawakami, Eiryo PY - 2023/11/30 TI - Developing Medical Education Curriculum Reform Strategies to Address the Impact of Generative AI: Qualitative Study JO - JMIR Med Educ SP - e53466 VL - 9 KW - artificial intelligence KW - curriculum reform KW - generative artificial intelligence KW - large language models KW - medical education KW - qualitative analysis KW - strengths-weaknesses-opportunities-threats (SWOT) framework N2 - Background: Generative artificial intelligence (GAI), represented by large language models, have the potential to transform health care and medical education. In particular, GAI?s impact on higher education has the potential to change students? learning experience as well as faculty?s teaching. However, concerns have been raised about ethical consideration and decreased reliability of the existing examinations. Furthermore, in medical education, curriculum reform is required to adapt to the revolutionary changes brought about by the integration of GAI into medical practice and research. Objective: This study analyzes the impact of GAI on medical education curricula and explores strategies for adaptation. Methods: The study was conducted in the context of faculty development at a medical school in Japan. A workshop involving faculty and students was organized, and participants were divided into groups to address two research questions: (1) How does GAI affect undergraduate medical education curricula? and (2) How should medical school curricula be reformed to address the impact of GAI? The strength, weakness, opportunity, and threat (SWOT) framework was used, and cross-SWOT matrix analysis was used to devise strategies. Further, 4 researchers conducted content analysis on the data generated during the workshop discussions. Results: The data were collected from 8 groups comprising 55 participants. Further, 5 themes about the impact of GAI on medical education curricula emerged: improvement of teaching and learning, improved access to information, inhibition of existing learning processes, problems in GAI, and changes in physicians? professionality. Positive impacts included enhanced teaching and learning efficiency and improved access to information, whereas negative impacts included concerns about reduced independent thinking and the adaptability of existing assessment methods. Further, GAI was perceived to change the nature of physicians? expertise. Three themes emerged from the cross-SWOT analysis for curriculum reform: (1) learning about GAI, (2) learning with GAI, and (3) learning aside from GAI. Participants recommended incorporating GAI literacy, ethical considerations, and compliance into the curriculum. Learning with GAI involved improving learning efficiency, supporting information gathering and dissemination, and facilitating patient involvement. Learning aside from GAI emphasized maintaining GAI-free learning processes, fostering higher cognitive domains of learning, and introducing more communication exercises. Conclusions: This study highlights the profound impact of GAI on medical education curricula and provides insights into curriculum reform strategies. Participants recognized the need for GAI literacy, ethical education, and adaptive learning. Further, GAI was recognized as a tool that can enhance efficiency and involve patients in education. The study also suggests that medical education should focus on competencies that GAI hardly replaces, such as clinical experience and communication. Notably, involving both faculty and students in curriculum reform discussions fosters a sense of ownership and ensures broader perspectives are encompassed. UR - https://mededu.jmir.org/2023/1/e53466 UR - http://dx.doi.org/10.2196/53466 UR - http://www.ncbi.nlm.nih.gov/pubmed/38032695 ID - info:doi/10.2196/53466 ER - TY - JOUR AU - Surapaneni, Mohan Krishna PY - 2023/11/7 TI - Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study JO - JMIR Med Educ SP - e47191 VL - 9 KW - ChatGPT KW - artificial intelligence KW - medical education KW - medical Biochemistry KW - biochemistry KW - chatbot KW - case study KW - case scenario KW - medical exam KW - medical examination KW - computer generated N2 - Background: ChatGPT has gained global attention recently owing to its high performance in generating a wide range of information and retrieving any kind of data instantaneously. ChatGPT has also been tested for the United States Medical Licensing Examination (USMLE) and has successfully cleared it. Thus, its usability in medical education is now one of the key discussions worldwide. Objective: The objective of this study is to evaluate the performance of ChatGPT in medical biochemistry using clinical case vignettes. Methods: The performance of ChatGPT was evaluated in medical biochemistry using 10 clinical case vignettes. Clinical case vignettes were randomly selected and inputted in ChatGPT along with the response options. We tested the responses for each clinical case twice. The answers generated by ChatGPT were saved and checked using our reference material. Results: ChatGPT generated correct answers for 4 questions on the first attempt. For the other cases, there were differences in responses generated by ChatGPT in the first and second attempts. In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases that were used. But, to our surprise, for case 3, different answers were obtained with multiple attempts. We believe this to have happened owing to the complexity of the case, which involved addressing various critical medical aspects related to amino acid metabolism in a balanced approach. Conclusions: According to the findings of our study, ChatGPT may not be considered an accurate information provider for application in medical education to improve learning and assessment. However, our study was limited by a small sample size (10 clinical case vignettes) and the use of the publicly available version of ChatGPT (version 3.5). Although artificial intelligence (AI) has the capability to transform medical education, we emphasize the validation of such data produced by such AI systems for correctness and dependability before it could be implemented in practice. UR - https://mededu.jmir.org/2023/1/e47191 UR - http://dx.doi.org/10.2196/47191 UR - http://www.ncbi.nlm.nih.gov/pubmed/37934568 ID - info:doi/10.2196/47191 ER - TY - JOUR AU - Ito, Naoki AU - Kadomatsu, Sakina AU - Fujisawa, Mineto AU - Fukaguchi, Kiyomitsu AU - Ishizawa, Ryo AU - Kanda, Naoki AU - Kasugai, Daisuke AU - Nakajima, Mikio AU - Goto, Tadahiro AU - Tsugawa, Yusuke PY - 2023/11/2 TI - The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study JO - JMIR Med Educ SP - e47532 VL - 9 KW - GPT-4 KW - racial and ethnic bias KW - typical clinical vignettes KW - diagnosis KW - triage KW - artificial intelligence KW - AI KW - race KW - clinical vignettes KW - physician KW - efficiency KW - decision-making KW - bias KW - GPT N2 - Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as ?correct? or ?incorrect.? Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients? race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4?s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. UR - https://mededu.jmir.org/2023/1/e47532 UR - http://dx.doi.org/10.2196/47532 UR - http://www.ncbi.nlm.nih.gov/pubmed/37917120 ID - info:doi/10.2196/47532 ER - TY - JOUR AU - Baglivo, Francesco AU - De Angelis, Luigi AU - Casigliani, Virginia AU - Arzilli, Guglielmo AU - Privitera, Pierpaolo Gaetano AU - Rizzo, Caterina PY - 2023/11/1 TI - Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study JO - JMIR Med Educ SP - e51421 VL - 9 KW - artificial intelligence KW - chatbots KW - medical education KW - vaccination KW - public health KW - medical students KW - large language model KW - generative AI KW - ChatGPT KW - Google Bard KW - AI chatbot KW - health education KW - health care KW - medical training KW - educational support tool KW - chatbot model N2 - Background: Artificial intelligence (AI) is a rapidly developing field with the potential to transform various aspects of health care and public health, including medical training. During the ?Hygiene and Public Health? course for fifth-year medical students, a practical training session was conducted on vaccination using AI chatbots as an educational supportive tool. Before receiving specific training on vaccination, the students were given a web-based test extracted from the Italian National Medical Residency Test. After completing the test, a critical correction of each question was performed assisted by AI chatbots. Objective: The main aim of this study was to identify whether AI chatbots can be considered educational support tools for training in public health. The secondary objective was to assess the performance of different AI chatbots on complex multiple-choice medical questions in the Italian language. Methods: A test composed of 15 multiple-choice questions on vaccination was extracted from the Italian National Medical Residency Test using targeted keywords and administered to medical students via Google Forms and to different AI chatbot models (Bing Chat, ChatGPT, Chatsonic, Google Bard, and YouChat). The correction of the test was conducted in the classroom, focusing on the critical evaluation of the explanations provided by the chatbot. A Mann-Whitney U test was conducted to compare the performances of medical students and AI chatbots. Student feedback was collected anonymously at the end of the training experience. Results: In total, 36 medical students and 5 AI chatbot models completed the test. The students achieved an average score of 8.22 (SD 2.65) out of 15, while the AI chatbots scored an average of 12.22 (SD 2.77). The results indicated a statistically significant difference in performance between the 2 groups (U=49.5, P<.001), with a large effect size (r=0.69). When divided by question type (direct, scenario-based, and negative), significant differences were observed in direct (P<.001) and scenario-based (P<.001) questions, but not in negative questions (P=.48). The students reported a high level of satisfaction (7.9/10) with the educational experience, expressing a strong desire to repeat the experience (7.6/10). Conclusions: This study demonstrated the efficacy of AI chatbots in answering complex medical questions related to vaccination and providing valuable educational support. Their performance significantly surpassed that of medical students in direct and scenario-based questions. The responsible and critical use of AI chatbots can enhance medical education, making it an essential aspect to integrate into the educational system. UR - https://mededu.jmir.org/2023/1/e51421 UR - http://dx.doi.org/10.2196/51421 UR - http://www.ncbi.nlm.nih.gov/pubmed/37910155 ID - info:doi/10.2196/51421 ER - TY - JOUR AU - Preiksaitis, Carl AU - Rose, Christian PY - 2023/10/20 TI - Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review JO - JMIR Med Educ SP - e48785 VL - 9 KW - medical education KW - artificial intelligence KW - ChatGPT KW - Bard KW - AI KW - educator KW - scoping KW - review KW - learner KW - generative N2 - Background: Generative artificial intelligence (AI) technologies are increasingly being utilized across various fields, with considerable interest and concern regarding their potential application in medical education. These technologies, such as Chat GPT and Bard, can generate new content and have a wide range of possible applications. Objective: This study aimed to synthesize the potential opportunities and limitations of generative AI in medical education. It sought to identify prevalent themes within recent literature regarding potential applications and challenges of generative AI in medical education and use these to guide future areas for exploration. Methods: We conducted a scoping review, following the framework by Arksey and O'Malley, of English language articles published from 2022 onward that discussed generative AI in the context of medical education. A literature search was performed using PubMed, Web of Science, and Google Scholar databases. We screened articles for inclusion, extracted data from relevant studies, and completed a quantitative and qualitative synthesis of the data. Results: Thematic analysis revealed diverse potential applications for generative AI in medical education, including self-directed learning, simulation scenarios, and writing assistance. However, the literature also highlighted significant challenges, such as issues with academic integrity, data accuracy, and potential detriments to learning. Based on these themes and the current state of the literature, we propose the following 3 key areas for investigation: developing learners? skills to evaluate AI critically, rethinking assessment methodology, and studying human-AI interactions. Conclusions: The integration of generative AI in medical education presents exciting opportunities, alongside considerable challenges. There is a need to develop new skills and competencies related to AI as well as thoughtful, nuanced approaches to examine the growing use of generative AI in medical education. UR - https://mededu.jmir.org/2023/1/e48785/ UR - http://dx.doi.org/10.2196/48785 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/48785 ER - TY - JOUR AU - Chen, Yanhua AU - Wu, Ziye AU - Wang, Peicheng AU - Xie, Linbo AU - Yan, Mengsha AU - Jiang, Maoqing AU - Yang, Zhenghan AU - Zheng, Jianjun AU - Zhang, Jingfeng AU - Zhu, Jiming PY - 2023/10/19 TI - Radiology Residents? Perceptions of Artificial Intelligence: Nationwide Cross-Sectional Survey Study JO - J Med Internet Res SP - e48249 VL - 25 KW - artificial intelligence KW - technology acceptance KW - radiology KW - residency KW - perceptions KW - health care services KW - resident KW - residents KW - perception KW - adoption KW - readiness KW - acceptance KW - cross sectional KW - survey N2 - Background: Artificial intelligence (AI) is transforming various fields, with health care, especially diagnostic specialties such as radiology, being a key but controversial battleground. However, there is limited research systematically examining the response of ?human intelligence? to AI. Objective: This study aims to comprehend radiologists? perceptions regarding AI, including their views on its potential to replace them, its usefulness, and their willingness to accept it. We examine the influence of various factors, encompassing demographic characteristics, working status, psychosocial aspects, personal experience, and contextual factors. Methods: Between December 1, 2020, and April 30, 2021, a cross-sectional survey was completed by 3666 radiology residents in China. We used multivariable logistic regression models to examine factors and associations, reporting odds ratios (ORs) and 95% CIs. Results: In summary, radiology residents generally hold a positive attitude toward AI, with 29.90% (1096/3666) agreeing that AI may reduce the demand for radiologists, 72.80% (2669/3666) believing AI improves disease diagnosis, and 78.18% (2866/3666) feeling that radiologists should embrace AI. Several associated factors, including age, gender, education, region, eye strain, working hours, time spent on medical images, resilience, burnout, AI experience, and perceptions of residency support and stress, significantly influence AI attitudes. For instance, burnout symptoms were associated with greater concerns about AI replacement (OR 1.89; P<.001), less favorable views on AI usefulness (OR 0.77; P=.005), and reduced willingness to use AI (OR 0.71; P<.001). Moreover, after adjusting for all other factors, perceived AI replacement (OR 0.81; P<.001) and AI usefulness (OR 5.97; P<.001) were shown to significantly impact the intention to use AI. Conclusions: This study profiles radiology residents who are accepting of AI. Our comprehensive findings provide insights for a multidimensional approach to help physicians adapt to AI. Targeted policies, such as digital health care initiatives and medical education, can be developed accordingly. UR - https://www.jmir.org/2023/1/e48249 UR - http://dx.doi.org/10.2196/48249 UR - http://www.ncbi.nlm.nih.gov/pubmed/37856181 ID - info:doi/10.2196/48249 ER - TY - JOUR AU - Hu, Je-Ming AU - Liu, Feng-Cheng AU - Chu, Chi-Ming AU - Chang, Yu-Tien PY - 2023/10/18 TI - Health Care Trainees? and Professionals? Perceptions of ChatGPT in Improving Medical Knowledge Training: Rapid Survey Study JO - J Med Internet Res SP - e49385 VL - 25 KW - ChatGPT KW - large language model KW - medicine KW - perception evaluation KW - internet survey KW - structural equation modeling KW - SEM N2 - Background: ChatGPT is a powerful pretrained large language model. It has both demonstrated potential and raised concerns related to knowledge translation and knowledge transfer. To apply and improve knowledge transfer in the real world, it is essential to assess the perceptions and acceptance of the users of ChatGPT-assisted training. Objective: We aimed to investigate the perceptions of health care trainees and professionals on ChatGPT-assisted training, using biomedical informatics as an example. Methods: We used purposeful sampling to include all health care undergraduate trainees and graduate professionals (n=195) from January to May 2023 in the School of Public Health at the National Defense Medical Center in Taiwan. Subjects were asked to watch a 2-minute video introducing 5 scenarios about ChatGPT-assisted training in biomedical informatics and then answer a self-designed online (web- and mobile-based) questionnaire according to the Kirkpatrick model. The survey responses were used to develop 4 constructs: ?perceived knowledge acquisition,? ?perceived training motivation,? ?perceived training satisfaction,? and ?perceived training effectiveness.? The study used structural equation modeling (SEM) to evaluate and test the structural model and hypotheses. Results: The online questionnaire response rate was 152 of 195 (78%); 88 of 152 participants (58%) were undergraduate trainees and 90 of 152 participants (59%) were women. The ages ranged from 18 to 53 years (mean 23.3, SD 6.0 years). There was no statistical difference in perceptions of training evaluation between men and women. Most participants were enthusiastic about the ChatGPT-assisted training, while the graduate professionals were more enthusiastic than undergraduate trainees. Nevertheless, some concerns were raised about potential cheating on training assessment. The average scores for knowledge acquisition, training motivation, training satisfaction, and training effectiveness were 3.84 (SD 0.80), 3.76 (SD 0.93), 3.75 (SD 0.87), and 3.72 (SD 0.91), respectively (Likert scale 1-5: strongly disagree to strongly agree). Knowledge acquisition had the highest score and training effectiveness the lowest. In the SEM results, training effectiveness was influenced predominantly by knowledge acquisition and partially met the hypotheses in the research framework. Knowledge acquisition had a direct effect on training effectiveness, training satisfaction, and training motivation, with ? coefficients of .80, .87, and .97, respectively (all P<.001). Conclusions: Most health care trainees and professionals perceived ChatGPT-assisted training as an aid in knowledge transfer. However, to improve training effectiveness, it should be combined with empirical experts for proper guidance and dual interaction. In a future study, we recommend using a larger sample size for evaluation of internet-connected large language models in medical knowledge transfer. UR - https://www.jmir.org/2023/1/e49385 UR - http://dx.doi.org/10.2196/49385 UR - http://www.ncbi.nlm.nih.gov/pubmed/37851495 ID - info:doi/10.2196/49385 ER - TY - JOUR AU - Khlaif, N. Zuheir AU - Mousa, Allam AU - Hattab, Kamal Muayad AU - Itmazi, Jamil AU - Hassan, A. Amjad AU - Sanmugam, Mageswaran AU - Ayyoub, Abedalkarim PY - 2023/9/14 TI - The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation JO - JMIR Med Educ SP - e47049 VL - 9 KW - artificial intelligence KW - AI KW - ChatGPT KW - scientific research KW - research ethics N2 - Background: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal, education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing (NLP), which refers to the ability of computers to understand and generate human language. Objective: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose, high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing the application?s impact on the research framework, data analysis, and the literature review. The study also explored concerns around ownership and the integrity of research when using AI-generated text. Methods: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchers developed an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated using ChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitative data provided by the reviewers. Results: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality research that could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research framework and data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing. Conclusions: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used in different fields such as medical education to deliver materials to develop the basic competencies for both medicine students and faculty members. UR - https://mededu.jmir.org/2023/1/e47049 UR - http://dx.doi.org/10.2196/47049 UR - http://www.ncbi.nlm.nih.gov/pubmed/37707884 ID - info:doi/10.2196/47049 ER - TY - JOUR AU - Sallam, Malik AU - Salim, A. Nesreen AU - Barakat, Muna AU - Al-Mahzoum, Kholoud AU - Al-Tammemi, B. Ala'a AU - Malaeb, Diana AU - Hallit, Rabih AU - Hallit, Souheil PY - 2023/9/5 TI - Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study JO - JMIR Med Educ SP - e48254 VL - 9 KW - artificial intelligence KW - machine learning KW - education KW - technology KW - healthcare KW - survey KW - opinion KW - knowledge KW - practices KW - KAP N2 - Background: ChatGPT is a conversational large language model that has the potential to revolutionize knowledge acquisition. However, the impact of this technology on the quality of education is still unknown considering the risks and concerns surrounding ChatGPT use. Therefore, it is necessary to assess the usability and acceptability of this promising tool. As an innovative technology, the intention to use ChatGPT can be studied in the context of the technology acceptance model (TAM). Objective: This study aimed to develop and validate a TAM-based survey instrument called TAME-ChatGPT (Technology Acceptance Model Edited to Assess ChatGPT Adoption) that could be employed to examine the successful integration and use of ChatGPT in health care education. Methods: The survey tool was created based on the TAM framework. It comprised 13 items for participants who heard of ChatGPT but did not use it and 23 items for participants who used ChatGPT. Using a convenient sampling approach, the survey link was circulated electronically among university students between February and March 2023. Exploratory factor analysis (EFA) was used to assess the construct validity of the survey instrument. Results: The final sample comprised 458 respondents, the majority among them undergraduate students (n=442, 96.5%). Only 109 (23.8%) respondents had heard of ChatGPT prior to participation and only 55 (11.3%) self-reported ChatGPT use before the study. EFA analysis on the attitude and usage scales showed significant Bartlett tests of sphericity scores (P<.001) and adequate Kaiser-Meyer-Olkin measures (0.823 for the attitude scale and 0.702 for the usage scale), confirming the factorability of the correlation matrices. The EFA showed that 3 constructs explained a cumulative total of 69.3% variance in the attitude scale, and these subscales represented perceived risks, attitude to technology/social influence, and anxiety. For the ChatGPT usage scale, EFA showed that 4 constructs explained a cumulative total of 72% variance in the data and comprised the perceived usefulness, perceived risks, perceived ease of use, and behavior/cognitive factors. All the ChatGPT attitude and usage subscales showed good reliability with Cronbach ? values >.78 for all the deduced subscales. Conclusions: The TAME-ChatGPT demonstrated good reliability, validity, and usefulness in assessing health care students? attitudes toward ChatGPT. The findings highlighted the importance of considering risk perceptions, usefulness, ease of use, attitudes toward technology, and behavioral factors when adopting ChatGPT as a tool in health care education. This information can aid the stakeholders in creating strategies to support the optimal and ethical use of ChatGPT and to identify the potential challenges hindering its successful implementation. Future research is recommended to guide the effective adoption of ChatGPT in health care education. UR - https://mededu.jmir.org/2023/1/e48254 UR - http://dx.doi.org/10.2196/48254 UR - http://www.ncbi.nlm.nih.gov/pubmed/37578934 ID - info:doi/10.2196/48254 ER - TY - JOUR AU - Roos, Jonas AU - Kasapovic, Adnan AU - Jansen, Tom AU - Kaczmarczyk, Robert PY - 2023/9/4 TI - Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany JO - JMIR Med Educ SP - e46482 VL - 9 KW - medical education KW - state examinations KW - exams KW - large language models KW - artificial intelligence KW - ChatGPT N2 - Background: Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  Objective: This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  Methods: The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  Results: GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  Conclusions: LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.  UR - https://mededu.jmir.org/2023/1/e46482 UR - http://dx.doi.org/10.2196/46482 UR - http://www.ncbi.nlm.nih.gov/pubmed/37665620 ID - info:doi/10.2196/46482 ER - TY - JOUR AU - Leung, I. Tiffany AU - Sagar, Ankita AU - Shroff, Swati AU - Henry, L. Tracey PY - 2023/8/23 TI - Can AI Mitigate Bias in Writing Letters of Recommendation? JO - JMIR Med Educ SP - e51494 VL - 9 KW - sponsorship KW - implicit bias KW - gender bias KW - bias KW - letters of recommendation KW - artificial intelligence KW - large language models KW - medical education KW - career advancement KW - tenure and promotion KW - promotion KW - leadership UR - https://mededu.jmir.org/2023/1/e51494 UR - http://dx.doi.org/10.2196/51494 UR - http://www.ncbi.nlm.nih.gov/pubmed/37610808 ID - info:doi/10.2196/51494 ER - TY - JOUR AU - Safranek, W. Conrad AU - Sidamon-Eristoff, Elizabeth Anne AU - Gilson, Aidan AU - Chartash, David PY - 2023/8/14 TI - The Role of Large Language Models in Medical Education: Applications and Implications JO - JMIR Med Educ SP - e50945 VL - 9 KW - large language models KW - ChatGPT KW - medical education KW - LLM KW - artificial intelligence in health care KW - AI KW - autoethnography UR - https://mededu.jmir.org/2023/1/e50945 UR - http://dx.doi.org/10.2196/50945 UR - http://www.ncbi.nlm.nih.gov/pubmed/37578830 ID - info:doi/10.2196/50945 ER - TY - JOUR AU - Gilson, Aidan AU - Safranek, W. Conrad AU - Huang, Thomas AU - Socrates, Vimig AU - Chi, Ling AU - Taylor, Andrew Richard AU - Chartash, David PY - 2023/7/13 TI - Authors? Reply to: Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations JO - JMIR Med Educ SP - e50336 VL - 9 KW - natural language processing KW - NLP KW - MedQA KW - generative pre-trained transformer KW - GPT KW - medical education KW - chatbot KW - artificial intelligence KW - AI KW - education technology KW - ChatGPT KW - conversational agent KW - machine learning KW - large language models KW - knowledge assessment UR - https://mededu.jmir.org/2023/1/e50336 UR - http://dx.doi.org/10.2196/50336 UR - http://www.ncbi.nlm.nih.gov/pubmed/37440299 ID - info:doi/10.2196/50336 ER - TY - JOUR AU - Epstein, H. Richard AU - Dexter, Franklin PY - 2023/7/13 TI - Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations. Comment on ?How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment? JO - JMIR Med Educ SP - e48305 VL - 9 KW - natural language processing KW - NLP KW - MedQA KW - generative pre-trained transformer KW - GPT KW - medical education KW - chatbot KW - artificial intelligence KW - AI KW - education technology KW - ChatGPT KW - Google Bard KW - conversational agent KW - machine learning KW - large language models KW - knowledge assessment UR - https://mededu.jmir.org/2023/1/e48305 UR - http://dx.doi.org/10.2196/48305 UR - http://www.ncbi.nlm.nih.gov/pubmed/37440293 ID - info:doi/10.2196/48305 ER - TY - JOUR AU - Abd-alrazaq, Alaa AU - AlSaad, Rawan AU - Alhuwail, Dari AU - Ahmed, Arfan AU - Healy, Mark Padraig AU - Latifi, Syed AU - Aziz, Sarah AU - Damseh, Rafat AU - Alabed Alrazak, Sadam AU - Sheikh, Javaid PY - 2023/6/1 TI - Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions JO - JMIR Med Educ SP - e48291 VL - 9 KW - large language models KW - artificial intelligence KW - medical education KW - ChatGPT KW - GPT-4 KW - generative AI KW - students KW - educators UR - https://mededu.jmir.org/2023/1/e48291 UR - http://dx.doi.org/10.2196/48291 UR - http://www.ncbi.nlm.nih.gov/pubmed/37261894 ID - info:doi/10.2196/48291 ER - TY - JOUR AU - Thirunavukarasu, James Arun AU - Hassan, Refaat AU - Mahmood, Shathar AU - Sanghera, Rohan AU - Barzangi, Kara AU - El Mukashfi, Mohanned AU - Shah, Sachin PY - 2023/4/21 TI - Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care JO - JMIR Med Educ SP - e46599 VL - 9 KW - ChatGPT KW - large language model KW - natural language processing KW - decision support techniques KW - artificial intelligence KW - AI KW - deep learning KW - primary care KW - general practice KW - family medicine KW - chatbot N2 - Background: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model?s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners? reports from 2018 to 2022. Novel explanations from ChatGPT?defined as information provided that was not inputted within the question or multiple answer choices?were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT?s strengths and weaknesses. Results: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT?s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ?=?0.241 and ?0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions: Large language models are approaching human expert?level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis. UR - https://mededu.jmir.org/2023/1/e46599 UR - http://dx.doi.org/10.2196/46599 UR - http://www.ncbi.nlm.nih.gov/pubmed/37083633 ID - info:doi/10.2196/46599 ER - TY - JOUR AU - Sabry Abdel-Messih, Mary AU - Kamel Boulos, N. Maged PY - 2023/3/8 TI - ChatGPT in Clinical Toxicology JO - JMIR Med Educ SP - e46876 VL - 9 KW - ChatGPT KW - clinical toxicology KW - organophosphates KW - artificial intelligence KW - AI KW - medical education UR - https://mededu.jmir.org/2023/1/e46876 UR - http://dx.doi.org/10.2196/46876 UR - http://www.ncbi.nlm.nih.gov/pubmed/36867743 ID - info:doi/10.2196/46876 ER - TY - JOUR AU - Eysenbach, Gunther PY - 2023/3/6 TI - The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers JO - JMIR Med Educ SP - e46885 VL - 9 KW - artificial intelligence KW - AI KW - ChatGPT KW - generative language model KW - medical education KW - interview KW - future of education UR - https://mededu.jmir.org/2023/1/e46885 UR - http://dx.doi.org/10.2196/46885 UR - http://www.ncbi.nlm.nih.gov/pubmed/36863937 ID - info:doi/10.2196/46885 ER - TY - JOUR AU - Gilson, Aidan AU - Safranek, W. Conrad AU - Huang, Thomas AU - Socrates, Vimig AU - Chi, Ling AU - Taylor, Andrew Richard AU - Chartash, David PY - 2023/2/8 TI - How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment JO - JMIR Med Educ SP - e45312 VL - 9 KW - natural language processing KW - NLP KW - MedQA KW - generative pre-trained transformer KW - GPT KW - medical education KW - chatbot KW - artificial intelligence KW - education technology KW - ChatGPT KW - conversational agent KW - machine learning KW - USMLE N2 - Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT?s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT?s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT?s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT?s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. UR - https://mededu.jmir.org/2023/1/e45312 UR - http://dx.doi.org/10.2196/45312 UR - http://www.ncbi.nlm.nih.gov/pubmed/36753318 ID - info:doi/10.2196/45312 ER - TY - JOUR AU - Grunhut, Joel AU - Marques, Oge AU - Wyatt, M. Adam T. PY - 2022/6/7 TI - Needs, Challenges, and Applications of Artificial Intelligence in Medical Education Curriculum JO - JMIR Med Educ SP - e35587 VL - 8 IS - 2 KW - artificial intelligence KW - AI KW - medical education KW - medical student UR - https://mededu.jmir.org/2022/2/e35587 UR - http://dx.doi.org/10.2196/35587 UR - http://www.ncbi.nlm.nih.gov/pubmed/35671077 ID - info:doi/10.2196/35587 ER - TY - JOUR AU - Gray, Kathleen AU - Slavotinek, John AU - Dimaguila, Luis Gerardo AU - Choo, Dawn PY - 2022/4/4 TI - Artificial Intelligence Education for the Health Workforce: Expert Survey of Approaches and Needs JO - JMIR Med Educ SP - e35223 VL - 8 IS - 2 KW - artificial intelligence KW - curriculum KW - ethics KW - human-computer interaction KW - interprofessional education KW - machine learning KW - natural language processing KW - professional development KW - robotics N2 - Background: The preparation of the current and future health workforce for the possibility of using artificial intelligence (AI) in health care is a growing concern as AI applications emerge in various care settings and specializations. At present, there is no obvious consensus among educators about what needs to be learned or how this learning may be supported or assessed. Objective: Our study aims to explore health care education experts? ideas and plans for preparing the health workforce to work with AI and identify critical gaps in curriculum and educational resources across a national health care system. Methods: A survey canvassed expert views on AI education for the health workforce in terms of educational strategies, subject matter priorities, meaningful learning activities, desired attitudes, and skills. A total of 39 senior people from different health workforce subgroups across Australia provided ratings and free-text responses in late 2020. Results: The responses highlighted the importance of education on ethical implications, suitability of large data sets for use in AI clinical applications, principles of machine learning, and specific diagnosis and treatment applications of AI as well as alterations to cognitive load during clinical work and the interaction between humans and machines in clinical settings. Respondents also outlined barriers to implementation, such as lack of governance structures and processes, resource constraints, and cultural adjustment. Conclusions: Further work around the world of the kind reported in this survey can assist educators and education authorities who are responsible for preparing the health workforce to minimize the risks and realize the benefits of implementing AI in health care. UR - https://mededu.jmir.org/2022/2/e35223 UR - http://dx.doi.org/10.2196/35223 UR - http://www.ncbi.nlm.nih.gov/pubmed/35249885 ID - info:doi/10.2196/35223 ER - TY - JOUR AU - Teng, Minnie AU - Singla, Rohit AU - Yau, Olivia AU - Lamoureux, Daniel AU - Gupta, Aurinjoy AU - Hu, Zoe AU - Hu, Ricky AU - Aissiou, Amira AU - Eaton, Shane AU - Hamm, Camille AU - Hu, Sophie AU - Kelly, Dayton AU - MacMillan, M. Kathleen AU - Malik, Shamir AU - Mazzoli, Vienna AU - Teng, Yu-Wen AU - Laricheva, Maria AU - Jarus, Tal AU - Field, S. Thalia PY - 2022/1/31 TI - Health Care Students? Perspectives on Artificial Intelligence: Countrywide Survey in Canada JO - JMIR Med Educ SP - e33390 VL - 8 IS - 1 KW - medical education KW - artificial intelligence KW - allied health education KW - medical students KW - health care students KW - medical curriculum KW - education N2 - Background: Artificial intelligence (AI) is no longer a futuristic concept; it is increasingly being integrated into health care. As studies on attitudes toward AI have primarily focused on physicians, there is a need to assess the perspectives of students across health care disciplines to inform future curriculum development. Objective: This study aims to explore and identify gaps in the knowledge that Canadian health care students have regarding AI, capture how health care students in different fields differ in their knowledge and perspectives on AI, and present student-identified ways that AI literacy may be incorporated into the health care curriculum. Methods: The survey was developed from a narrative literature review of topics in attitudinal surveys on AI. The final survey comprised 15 items, including multiple-choice questions, pick-group-rank questions, 11-point Likert scale items, slider scale questions, and narrative questions. We used snowball and convenience sampling methods by distributing an email with a description and a link to the web-based survey to representatives from 18 Canadian schools. Results: A total of 2167 students across 10 different health professions from 18 universities across Canada responded to the survey. Overall, 78.77% (1707/2167) predicted that AI technology would affect their careers within the coming decade and 74.5% (1595/2167) reported a positive outlook toward the emerging role of AI in their respective fields. Attitudes toward AI varied by discipline. Students, even those opposed to AI, identified the need to incorporate a basic understanding of AI into their curricula. Conclusions: We performed a nationwide survey of health care students across 10 different health professions in Canada. The findings would inform student-identified topics within AI and their preferred delivery formats, which would advance education across different health care professions. UR - https://mededu.jmir.org/2022/1/e33390 UR - http://dx.doi.org/10.2196/33390 UR - http://www.ncbi.nlm.nih.gov/pubmed/35099397 ID - info:doi/10.2196/33390 ER - TY - JOUR AU - Charow, Rebecca AU - Jeyakumar, Tharshini AU - Younus, Sarah AU - Dolatabadi, Elham AU - Salhia, Mohammad AU - Al-Mouaswas, Dalia AU - Anderson, Melanie AU - Balakumar, Sarmini AU - Clare, Megan AU - Dhalla, Azra AU - Gillan, Caitlin AU - Haghzare, Shabnam AU - Jackson, Ethan AU - Lalani, Nadim AU - Mattson, Jane AU - Peteanu, Wanda AU - Tripp, Tim AU - Waldorf, Jacqueline AU - Williams, Spencer AU - Tavares, Walter AU - Wiljer, David PY - 2021/12/13 TI - Artificial Intelligence Education Programs for Health Care Professionals: Scoping Review JO - JMIR Med Educ SP - e31043 VL - 7 IS - 4 KW - machine learning KW - deep learning KW - health care providers KW - education KW - learning KW - patient care N2 - Background: As the adoption of artificial intelligence (AI) in health care increases, it will become increasingly crucial to involve health care professionals (HCPs) in developing, validating, and implementing AI-enabled technologies. However, because of a lack of AI literacy, most HCPs are not adequately prepared for this revolution. This is a significant barrier to adopting and implementing AI that will affect patients. In addition, the limited existing AI education programs face barriers to development and implementation at various levels of medical education. Objective: With a view to informing future AI education programs for HCPs, this scoping review aims to provide an overview of the types of current or past AI education programs that pertains to the programs? curricular content, modes of delivery, critical implementation factors for education delivery, and outcomes used to assess the programs? effectiveness. Methods: After the creation of a search strategy and keyword searches, a 2-stage screening process was conducted by 2 independent reviewers to determine study eligibility. When consensus was not reached, the conflict was resolved by consulting a third reviewer. This process consisted of a title and abstract scan and a full-text review. The articles were included if they discussed an actual training program or educational intervention, or a potential training program or educational intervention and the desired content to be covered, focused on AI, and were designed or intended for HCPs (at any stage of their career). Results: Of the 10,094 unique citations scanned, 41 (0.41%) studies relevant to our eligibility criteria were identified. Among the 41 included studies, 10 (24%) described 13 unique programs and 31 (76%) discussed recommended curricular content. The curricular content of the unique programs ranged from AI use, AI interpretation, and cultivating skills to explain results derived from AI algorithms. The curricular topics were categorized into three main domains: cognitive, psychomotor, and affective. Conclusions: This review provides an overview of the current landscape of AI in medical education and highlights the skills and competencies required by HCPs to effectively use AI in enhancing the quality of care and optimizing patient outcomes. Future education efforts should focus on the development of regulatory strategies, a multidisciplinary approach to curriculum redesign, a competency-based curriculum, and patient-clinician interaction. UR - https://mededu.jmir.org/2021/4/e31043 UR - http://dx.doi.org/10.2196/31043 UR - http://www.ncbi.nlm.nih.gov/pubmed/34898458 ID - info:doi/10.2196/31043 ER - TY - JOUR AU - Sapci, Hasan A. AU - Sapci, Aylin H. PY - 2020/6/30 TI - Artificial Intelligence Education and Tools for Medical and Health Informatics Students: Systematic Review JO - JMIR Med Educ SP - e19285 VL - 6 IS - 1 KW - artificial intelligence KW - education KW - machine learning KW - deep learning KW - medical education KW - health informatics KW - systematic review N2 - Background: The use of artificial intelligence (AI) in medicine will generate numerous application possibilities to improve patient care, provide real-time data analytics, and enable continuous patient monitoring. Clinicians and health informaticians should become familiar with machine learning and deep learning. Additionally, they should have a strong background in data analytics and data visualization to use, evaluate, and develop AI applications in clinical practice. Objective: The main objective of this study was to evaluate the current state of AI training and the use of AI tools to enhance the learning experience. Methods: A comprehensive systematic review was conducted to analyze the use of AI in medical and health informatics education, and to evaluate existing AI training practices. PRISMA-P (Preferred Reporting Items for Systematic Reviews and Meta-Analysis Protocols) guidelines were followed. The studies that focused on the use of AI tools to enhance medical education and the studies that investigated teaching AI as a new competency were categorized separately to evaluate recent developments. Results: This systematic review revealed that recent publications recommend the integration of AI training into medical and health informatics curricula. Conclusions: To the best of our knowledge, this is the first systematic review exploring the current state of AI education in both medicine and health informatics. Since AI curricula have not been standardized and competencies have not been determined, a framework for specialized AI training in medical and health informatics education is proposed. UR - http://mededu.jmir.org/2020/1/e19285/ UR - http://dx.doi.org/10.2196/19285 UR - http://www.ncbi.nlm.nih.gov/pubmed/32602844 ID - info:doi/10.2196/19285 ER - TY - JOUR AU - Paranjape, Ketan AU - Schinkel, Michiel AU - Nannan Panday, Rishi AU - Car, Josip AU - Nanayakkara, Prabath PY - 2019/12/3 TI - Introducing Artificial Intelligence Training in Medical Education JO - JMIR Med Educ SP - e16048 VL - 5 IS - 2 KW - algorithm KW - artificial intelligence KW - black box KW - deep learning KW - machine learning KW - medical education KW - continuing education KW - data sciences KW - curriculum UR - http://mededu.jmir.org/2019/2/e16048/ UR - http://dx.doi.org/10.2196/16048 UR - http://www.ncbi.nlm.nih.gov/pubmed/31793895 ID - info:doi/10.2196/16048 ER - TY - JOUR AU - Chan, Siang Kai AU - Zary, Nabil PY - 2019/6/15 TI - Applications and Challenges of Implementing Artificial Intelligence in Medical Education: Integrative Review JO - JMIR Med Educ SP - e13930 VL - 5 IS - 1 KW - medical education KW - evaluation of AIED systems KW - real world applications of AIED systems KW - artificial intelligence N2 - Background: Since the advent of artificial intelligence (AI) in 1955, the applications of AI have increased over the years within a rapidly changing digital landscape where public expectations are on the rise, fed by social media, industry leaders, and medical practitioners. However, there has been little interest in AI in medical education until the last two decades, with only a recent increase in the number of publications and citations in the field. To our knowledge, thus far, a limited number of articles have discussed or reviewed the current use of AI in medical education. Objective: This study aims to review the current applications of AI in medical education as well as the challenges of implementing AI in medical education. Methods: Medline (Ovid), EBSCOhost Education Resources Information Center (ERIC) and Education Source, and Web of Science were searched with explicit inclusion and exclusion criteria. Full text of the selected articles was analyzed using the Extension of Technology Acceptance Model and the Diffusions of Innovations theory. Data were subsequently pooled together and analyzed quantitatively. Results: A total of 37 articles were identified. Three primary uses of AI in medical education were identified: learning support (n=32), assessment of students? learning (n=4), and curriculum review (n=1). The main reasons for use of AI are its ability to provide feedback and a guided learning pathway and to decrease costs. Subgroup analysis revealed that medical undergraduates are the primary target audience for AI use. In addition, 34 articles described the challenges of AI implementation in medical education; two main reasons were identified: difficulty in assessing the effectiveness of AI in medical education and technical challenges while developing AI applications. Conclusions: The primary use of AI in medical education was for learning support mainly due to its ability to provide individualized feedback. Little emphasis was placed on curriculum review and assessment of students? learning due to the lack of digitalization and sensitive nature of examinations, respectively. Big data manipulation also warrants the need to ensure data integrity. Methodological improvements are required to increase AI adoption by addressing the technical difficulties of creating an AI application and using novel methods to assess the effectiveness of AI. To better integrate AI into the medical profession, measures should be taken to introduce AI into the medical school curriculum for medical professionals to better understand AI algorithms and maximize its use. UR - http://mededu.jmir.org/2019/1/e13930/ UR - http://dx.doi.org/10.2196/13930 UR - http://www.ncbi.nlm.nih.gov/pubmed/31199295 ID - info:doi/10.2196/13930 ER -