TY  - JOUR
AU  - Joranger, Pål
AU  - Rivenes Lafontan, Sara
AU  - Brevik, Asgeir
PY  - 2025/7/3
TI  - Evaluating a Large Language Model?s Ability to Synthesize a Health Science Master?s Thesis: Case Study
JO  - JMIR Form Res
SP  - e73248
VL  - 9
KW  - master?s thesis
KW  - large language model
KW  - LLM
KW  - ChatGPT
KW  - health science
KW  - qualitative
KW  - quantitative
N2  - Background: Large language models (LLMs) can aid students in mastering a new topic fast, but for the educational institutions responsible for assessing and grading the academic level of students, it can be difficult to discern whether a text has originated from a student?s own cognition or has been synthesized by an LLM. Universities have traditionally relied on a submitted written thesis as proof of higher-level learning, on which to grant grades and diplomas. But what happens when LLMs are able to mimic the academic writing of subject matter experts? This is now a real dilemma. The ubiquitous availability of LLMs challenges trust in the master?s thesis as evidence of subject matter comprehension and academic competencies. Objective: In this study, we aimed to assess the quality of rapid machine-generated papers against the standards of the health science master?s program we are currently affiliated with. Methods: In an exploratory case study, we used ChatGPT (OpenAI) to generate 2 research papers as conceivable student submissions for master?s thesis graduation from a health science master?s program. One paper simulated a qualitative health science research project and another simulated a quantitative health science research project. Results: Using a stepwise approach, we prompted ChatGPT to (1) synthesize 2 credible datasets, and (2) generate 2 papers, that?in our judgment?would have been able to pass as credible medium-quality graduation research papers at the health science master?s program the authors are currently affiliated with. It took 2.5 hours of iterative dialogue with ChatGPT to develop the qualitative paper and 3.5 hours to develop the quantitative paper. Making the synthetic datasets that served as a starting point for our ChatGPT-driven paper development took 1.5 and 16 hours for the qualitative and quantitative datasets, respectively. This included learning and prompt optimization, and for the quantitative dataset, it included the time it took to create tables, estimate relevant bivariate correlation coefficients, and prepare these coefficients to be read by ChatGPT. Conclusions: Our demonstration highlights the ease with which an LLM can synthesize research data, conduct scientific analyses, and produce credible research papers required for graduation from a master?s program. A clear and well-written master?s thesis, citing subject matter authorities and true to the expectations for academic writing, can no longer be regarded as solid proof of either extensive study or subject matter mastery. To uphold the integrity of academic standards and the value of university diplomas, we recommend that master?s programs prioritize oral examinations and school exams. This shift is now crucial to ensure a fair and rigorous assessment of higher-order learning and abilities at the master?s level. 
UR  - https://formative.jmir.org/2025/1/e73248
UR  - http://dx.doi.org/10.2196/73248
ID  - info:doi/10.2196/73248
ER  - 

TY  - JOUR
AU  - Yao, Zhong
AU  - Duan, Liantan
AU  - Xu, Shuo
AU  - Chi, Lingyi
AU  - Sheng, Dongfang
PY  - 2025/6/27
TI  - Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations
JO  - JMIR Med Inform
SP  - e69485
VL  - 13
KW  - large language model
KW  - medical examination
KW  - non-English
KW  - ChatGPT
KW  - language corpora
N2  - Background: Research on large language models (LLMs) in the medical field has predominantly focused on models trained with English-language corpora, evaluating their performance within English-speaking contexts. The performances of models trained with non?English language corpora and their performance in non-English contexts remain underexplored. Objective: This study aimed to evaluate the performances of LLMs trained on different languages corpora by using the Chinese National Medical Licensing Examination (CNMLE) as a benchmark and constructed analogous questions. Methods: Under different prompt settings, we sequentially posed questions to 7 LLMs: 2 primarily trained on English-language corpora and 5 primarily on Chinese-language corpora. The models? responses were compared against standard answers to calculate the accuracy rate of each model. Further subgroup analyses were conducted by categorizing the questions based on various criteria. We also collected error sets to explore patterns of mistakes across different models. Results: Under the zero-shot setting, 6 out of 7 models exceeded the passing level, with the highest accuracy rate achieved by the Chinese LLM Baichuan (86.67%), followed by ChatGPT (83.83%). In the constructed questions, all 7 models exceeded the passing threshold, with Baichuan maintaining the highest accuracy rate (87.00%). In few-shot learning, all models exceeded the passing threshold. Baichuan, ChatGLM, and ChatGPT retained the highest accuracy. While Llama showed marked improvement over previous tests, the relative performance rankings of other models stayed similar to previous results. In subgroup analyses, English models demonstrated comparable or superior performance to Chinese models on questions related to ethics and policy. All models except Llama generally had higher accuracy rates for simple questions than for complex ones. The error set of ChatGPT was similar to those of other Chinese models. Multimodel cross-verification outperformed single model, particularly improving accuracy rate on simple questions. The implementation of dual-model and tri-model verification achieved accuracy rates of 94.17% and 96.33% respectively. Conclusions: At the current level, LLMs trained primarily on English corpora and those trained mainly on Chinese corpora perform similarly well in CNMLE, with Chinese models still outperforming. The performance difference between ChatGPT and other Chinese LLMs are not solely due to communication barriers but are more likely influenced by disparities in the training data. By using a method of cross-verification with multiple LLMs, excellent performance can be achieved in medical examinations. 
UR  - https://medinform.jmir.org/2025/1/e69485
UR  - http://dx.doi.org/10.2196/69485
ID  - info:doi/10.2196/69485
ER  - 

TY  - JOUR
AU  - Cross, Joseph
AU  - Kayalackakom, Tarron
AU  - Robinson, E. Raymond
AU  - Vaughans, Andrea
AU  - Sebastian, Roopa
AU  - Hood, Ricardo
AU  - Lewis, Courtney
AU  - Devaraju, Sumanth
AU  - Honnavar, Prasanna
AU  - Naik, Sheetal
AU  - Joseph, Jillwin
AU  - Anand, Nikhilesh
AU  - Mohammed, Abdalla
AU  - Johnson, Asjah
AU  - Cohen, Eliran
AU  - Adeniji, Teniola
AU  - Nnenna Nnaji, Aisling
AU  - George, Elizabeth Julia
PY  - 2025/5/20
TI  - Assessing ChatGPT?s Capability as a New Age Standardized Patient: Qualitative Study
JO  - JMIR Med Educ
SP  - e63353
VL  - 11
KW  - medical education
KW  - standardized patient
KW  - AI
KW  - ChatGPT
KW  - virtual patient
KW  - assessment
KW  - standardized patients
KW  - LLM
KW  - effectiveness
KW  - medical school
KW  - qualitative
KW  - flexibility
KW  - diagnostic
N2  - Background: Standardized patients (SPs) have been crucial in medical education, offering realistic patient interactions to students. Despite their benefits, SP training is resource-intensive and access can be limited. Advances in artificial intelligence (AI), particularly with large language models such as ChatGPT, present new opportunities for virtual SPs, potentially addressing these limitations. Objectives: This study aims to assess medical students? perceptions and experiences of using ChatGPT as an SP and to evaluate ChatGPT?s effectiveness in performing as a virtual SP in a medical school setting. Methods: This qualitative study, approved by the American University of Antigua Institutional Review Board, involved 9 students (5 females and 4 males, aged 22?48 years) from the American University of Antigua College of Medicine. Students were observed during a live role-play, interacting with ChatGPT as an SP using a predetermined prompt. A structured 15-question survey was administered before and after the interaction. Thematic analysis was conducted on the transcribed and coded responses, with inductive category formation. Results: Thematic analysis identified key themes preinteraction including technology limitations (eg, prompt engineering difficulties), learning efficacy (eg, potential for personalized learning and reduced interview stress), verisimilitude (eg, absence of visual cues), and trust (eg, concerns about AI accuracy). Postinteraction, students noted improvements in prompt engineering, some alignment issues (eg, limited responses on sensitive topics), maintained learning efficacy (eg, convenience and repetition), and continued verisimilitude challenges (eg, lack of empathy and nonverbal cues). No significant trust issues were reported postinteraction. Despite some limitations, students found ChatGPT as a valuable supplement to traditional SPs, enhancing practice flexibility and diagnostic skills. Conclusions: ChatGPT can effectively augment traditional SPs in medical education, offering accessible, flexible practice opportunities. However, it cannot fully replace human SPs due to limitations in verisimilitude and prompt engineering challenges. Integrating prompt engineering into medical curricula and continuous advancements in AI are recommended to enhance the use of virtual SPs. 
UR  - https://mededu.jmir.org/2025/1/e63353
UR  - http://dx.doi.org/10.2196/63353
ID  - info:doi/10.2196/63353
ER  - 

TY  - JOUR
AU  - Ozkan, Ecem
AU  - Tekin, Aysun
AU  - Ozkan, Can Mahmut
AU  - Cabrera, Daniel
AU  - Niven, Alexander
AU  - Dong, Yue
PY  - 2025/5/12
TI  - Global Health care Professionals? Perceptions of Large Language Model Use In Practice: Cross-Sectional Survey Study
JO  - JMIR Med Educ
SP  - e58801
VL  - 11
KW  - ChatGPT
KW  - LLM
KW  - global
KW  - health care professionals
KW  - large language model
KW  - language model
KW  - chatbot
KW  - AI
KW  - diagnostic accuracy
KW  - efficiency
KW  - treatment planning
KW  - patient outcome
KW  - patient care
KW  - survey
KW  - physicians
KW  - nurses
KW  - educators
KW  - patient communication
KW  - clinical
KW  - educational
KW  - utilization
KW  - artificial intelligence
N2  - Background: ChatGPT is a large language model-based chatbot developed by OpenAI. ChatGPT has many potential applications to health care, including enhanced diagnostic accuracy and efficiency, improved treatment planning, and better patient outcomes. However, health care professionals? perceptions of ChatGPT and similar artificial intelligence tools are not well known. Understanding these attitudes is important to inform the best approaches to exploring their use in medicine. Objective: Our aim was to evaluate the health care professionals? awareness and perceptions regarding potential applications of ChatGPT in the medical field, including potential benefits and challenges of adoption. Methods: We designed a 33-question online survey that was distributed among health care professionals via targeted emails and professional Twitter and LinkedIn accounts. The survey included a range of questions to define respondents? demographic characteristics, familiarity with ChatGPT, perceptions of this tool?s usefulness and reliability, and opinions on its potential to improve patient care, research, and education efforts. Results: One hundred and fifteen health care professionals from 21 countries responded to the survey, including physicians, nurses, researchers, and educators. Of these, 101 (87.8%) had heard of ChatGPT, mainly from peers, social media, and news, and 77 (76.2%) had used ChatGPT at least once. Participants found ChatGPT to be helpful for writing manuscripts (n=31, 45.6%), emails (n=25, 36.8%), and grants (n=12, 17.6%); accessing the latest research and evidence-based guidelines (n=21, 30.9%); providing suggestions on diagnosis or treatment (n=15, 22.1%); and improving patient communication (n=12, 17.6%). Respondents also felt that the ability of ChatGPT to access and summarize research articles (n=22, 46.8%), provide quick answers to clinical questions (n=15, 31.9%), and generate patient education materials (n=10, 21.3%) was helpful. However, there are concerns regarding the use of ChatGPT, for example, the accuracy of responses (n=14, 29.8%), limited applicability in specific practices (n=18, 38.3%), and legal and ethical considerations (n=6, 12.8%), mainly related to plagiarism or copyright violations. Participants stated that safety protocols such as data encryption (n=63, 62.4%) and access control (n=52, 51.5%) could assist in ensuring patient privacy and data security. Conclusions: Our findings show that ChatGPT use is widespread among health care professionals in daily clinical, research, and educational activities. The majority of our participants found ChatGPT to be useful; however, there are concerns about patient privacy, data security, and its legal and ethical issues as well as the accuracy of its information. Further studies are required to understand the impact of ChatGPT and other large language models on clinical, educational, and research outcomes, and the concerns regarding its use must be addressed systematically and through appropriate methods. 
UR  - https://mededu.jmir.org/2025/1/e58801
UR  - http://dx.doi.org/10.2196/58801
ID  - info:doi/10.2196/58801
ER  - 

TY  - JOUR
AU  - Yano, Yuichiro
AU  - Ohashi, Mizuki
AU  - Miyagami, Taiju
AU  - Mori, Hirotake
AU  - Nishizaki, Yuji
AU  - Daida, Hiroyuki
AU  - Naito, Toshio
PY  - 2025/5/12
TI  - The Advanced Reasoning Capabilities of Large Language Models for Detecting Contraindicated Options in Medical Exams
JO  - JMIR Med Inform
SP  - e68527
VL  - 13
KW  - natural language processing
KW  - artificial intelligence
KW  - clinical reasoning
KW  - medical errors
KW  - large language model
UR  - https://medinform.jmir.org/2025/1/e68527
UR  - http://dx.doi.org/10.2196/68527
ID  - info:doi/10.2196/68527
ER  - 

TY  - JOUR
AU  - Elabd, Noor
AU  - Rahman, Muhammad Zafirah
AU  - Abu Alinnin, Ibrahim Salma
AU  - Jahan, Samiyah
AU  - Campos, Aparecida Luciana
AU  - Baltatu, Constantin Ovidiu
PY  - 2025/5/8
TI  - Designing Personalized Multimodal Mnemonics With AI: A Medical Student?s Implementation Tutorial
JO  - JMIR Med Educ
SP  - e67926
VL  - 11
KW  - medical education
KW  - personalized learning
KW  - prompt engineering
KW  - multimodal learning
KW  - memory techniques
KW  - dual-coding theory
KW  - student-centered approach
KW  - student-centered
KW  - large language model
KW  - natural language processing
KW  - NLP
KW  - machine learning
KW  - AI
KW  - ChatGPT
KW  - medical student
KW  - digital literacy
KW  - health care professional
N2  - Background: Medical education can be challenging for students as they must manage vast amounts of complex information. Traditional mnemonic resources often follow a standardized approach, which may not accommodate diverse learning styles. Objective: This tutorial presents a student-developed approach to creating personalized multimodal mnemonics (PMMs) using artifical intelligence tools. Methods: This tutorial demonstrates a structured implementation process using ChatGPT (GPT-4 model) for text mnemonic generation and DALL-E 3 for visual mnemonic creation. We detail the prompt engineering framework, including zero-shot, few-shot, and chain-of-thought prompting techniques. The process involves (1) template development, (2) refinement, (3) personalization, (4) mnemonic specification, and (5) quality control. The implementation time typically ranges from 2 to 5 minutes per concept, with 1 to 3 iterations needed for optimal results. Results: Through systematic testing across 6 medical concepts, the implementation process achieved an initial success rate of 85%, improving to 95% after refinement. Key challenges included maintaining medical accuracy (addressed through specific terminology in prompts), ensuring visual clarity (improved through anatomical detail specifications), and achieving integration of text and visuals (resolved through structured review protocols). This tutorial provides practical templates, troubleshooting strategies, and quality control measures to address common implementation challenges. Conclusions: This tutorial offers medical students a practical framework for creating personalized learning tools using artificial intelligence. By following the detailed prompt engineering process and quality control measures, students can efficiently generate customized mnemonics while avoiding common pitfalls. The approach emphasizes human oversight and iterative refinement to ensure medical accuracy and educational value. The elimination of the need for developing separate databases of mnemonics streamlines the learning process. 
UR  - https://mededu.jmir.org/2025/1/e67926
UR  - http://dx.doi.org/10.2196/67926
ID  - info:doi/10.2196/67926
ER  - 

TY  - JOUR
AU  - Teng, Joyce
AU  - Novoa, Andres Roberto
AU  - Aleshin, Alexandrovna Maria
AU  - Lester, Jenna
AU  - Seiger, Kira
AU  - Dzuali, Fiatsogbe
AU  - Daneshjou, Roxana
PY  - 2025/4/11
TI  - Authors? Reply: Enhancing AI-Driven Medical Translations: Considerations for Language Concordance
JO  - JMIR Med Educ
SP  - e71721
VL  - 11
KW  - ChatGPT
KW  - artificial intelligence
KW  - language
KW  - translation
KW  - health care disparity
KW  - natural language model
KW  - survey
KW  - patient education
KW  - accessibility
KW  - preference
KW  - human language
KW  - communication
KW  - language-concordant care
UR  - https://mededu.jmir.org/2025/1/e71721
UR  - http://dx.doi.org/10.2196/71721
ID  - info:doi/10.2196/71721
ER  - 

TY  - JOUR
AU  - Quon, Stephanie
AU  - Zhou, Sarah
PY  - 2025/4/11
TI  - Enhancing AI-Driven Medical Translations: Considerations for Language Concordance
JO  - JMIR Med Educ
SP  - e70420
VL  - 11
KW  - letter to the editor
KW  - ChatGPT
KW  - AI
KW  - artificial intelligence
KW  - language
KW  - translation
KW  - health care disparity
KW  - natural language model
KW  - survey
KW  - patient education
KW  - accessibility
KW  - preference
KW  - human language
KW  - communication
KW  - language-concordant care
UR  - https://mededu.jmir.org/2025/1/e70420
UR  - http://dx.doi.org/10.2196/70420
ID  - info:doi/10.2196/70420
ER  - 

TY  - JOUR
AU  - Bolgova, Olena
AU  - Shypilova, Inna
AU  - Mavrych, Volodymyr
PY  - 2025/4/10
TI  - Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
JO  - JMIR Med Educ
SP  - e67244
VL  - 11
KW  - ChatGPT
KW  - Claude
KW  - Gemini
KW  - Copilot
KW  - biochemistry
KW  - LLM
KW  - medical education
KW  - artificial intelligence
KW  - NLP
KW  - natural language processing
KW  - machine learning
KW  - large language model
KW  - AI
KW  - ML
KW  - comprehensive analysis
KW  - medical students
KW  - GPT-4
KW  - questionnaire
KW  - medical course
KW  - bioenergetics
N2  - Background: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. Objective: The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots?Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)?against the academic results of medical students in the medical biochemistry course. Methods: We used 200 USMLE (United States Medical Licensing Examination)?style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4?1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data?s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05. Results: On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students? performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04). Conclusions: Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment. 
UR  - https://mededu.jmir.org/2025/1/e67244
UR  - http://dx.doi.org/10.2196/67244
ID  - info:doi/10.2196/67244
ER  - 

TY  - JOUR
AU  - Wei, Bin
AU  - Yao, Lili
AU  - Hu, Xin
AU  - Hu, Yuxiang
AU  - Rao, Jie
AU  - Ji, Yu
AU  - Dong, Zhuoer
AU  - Duan, Yichong
AU  - Wu, Xiaorong
PY  - 2025/4/10
TI  - Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study
JO  - J Med Internet Res
SP  - e67883
VL  - 27
KW  - LLM
KW  - large language models
KW  - ocular myasthenia gravis
KW  - patient education
KW  - China
KW  - effectiveness
KW  - deep learning
KW  - artificial intelligence
KW  - health care
KW  - accuracy
KW  - applicability
KW  - neuromuscular disorder
KW  - extraocular muscles
KW  - ptosis
KW  - diplopia
KW  - ophthalmology
KW  - ChatGPT
KW  - clinical practice
KW  - digital health
N2  - Background: Ocular myasthenia gravis (OMG) is a neuromuscular disorder primarily affecting the extraocular muscles, leading to ptosis and diplopia. Effective patient education is crucial for disease management; however, in China, limited health care resources often restrict patients? access to personalized medical guidance. Large language models (LLMs) have emerged as potential tools to bridge this gap by providing instant, AI-driven health information. However, their accuracy and readability in educating patients with OMG remain uncertain. Objective: The purpose of this study was to systematically evaluate the effectiveness of multiple LLMs in the education of Chinese patients with OMG. Specifically, the validity of these models in answering patients with OMG-related questions was assessed through accuracy, completeness, readability, usefulness, and safety, and patients? ratings of their usability and readability were analyzed. Methods: The study was conducted in two phases: 130 choice ophthalmology examination questions were input into 5 different LLMs. Their performance was compared with that of undergraduates, master?s students, and ophthalmology residents. In addition, 23 common patients with OMG-related patient questions were posed to 4 LLMs, and their responses were evaluated by ophthalmologists across 5 domains. In the second phase, 20 patients with OMG interacted with the 2 LLMs from the first phase, each asking 3 questions. Patients assessed the responses for satisfaction and readability, while ophthalmologists evaluated the responses again using the 5 domains. Results: ChatGPT o1-preview achieved the highest accuracy rate of 73% on 130 ophthalmology examination questions, outperforming other LLMs and professional groups like undergraduates and master?s students. For 23 common patients with OMG-related questions, ChatGPT o1-preview scored highest in correctness (4.44), completeness (4.44), helpfulness (4.47), and safety (4.6). GEMINI (Google DeepMind) provided the easiest-to-understand responses in readability assessments, while GPT-4o had the most complex responses, suitable for readers with higher education levels. In the second phase with 20 patients with OMG, ChatGPT o1-preview received higher satisfaction scores than Ernie 3.5 (Baidu; 4.40 vs 3.89, P=.002), although Ernie 3.5?s responses were slightly more readable (4.31 vs 4.03, P=.01). Conclusions: LLMs such as ChatGPT o1-preview may have the potential to enhance patient education. Addressing challenges such as misinformation risk, readability issues, and ethical considerations is crucial for their effective and safe integration into clinical practice. 
UR  - https://www.jmir.org/2025/1/e67883
UR  - http://dx.doi.org/10.2196/67883
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/67883
ER  - 

TY  - JOUR
AU  - Temsah, Mohamad-Hani
AU  - Al-Eyadhy, Ayman
AU  - Jamal, Amr
AU  - Alhasan, Khalid
AU  - Malki, H. Khalid
PY  - 2025/4/2
TI  - Authors? Reply: Citation Accuracy Challenges Posed by Large Language Models
JO  - JMIR Med Educ
SP  - e73698
VL  - 11
KW  - ChatGPT
KW  - Gemini
KW  - DeepSeek
KW  - medical education
KW  - AI
KW  - artificial intelligence
KW  - Saudi Arabia
KW  - perceptions
KW  - medical students
KW  - faculty
KW  - LLM
KW  - chatbot
KW  - qualitative study
KW  - thematic analysis
KW  - satisfaction
KW  - RAG retrieval-augmented generation
UR  - https://mededu.jmir.org/2025/1/e73698
UR  - http://dx.doi.org/10.2196/73698
ID  - info:doi/10.2196/73698
ER  - 

TY  - JOUR
AU  - Zhang, Manlin
AU  - Zhao, Tianyu
PY  - 2025/4/2
TI  - Citation Accuracy Challenges Posed by Large Language Models
JO  - JMIR Med Educ
SP  - e72998
VL  - 11
KW  - chatGPT
KW  - medical education
KW  - Saudi Arabia
KW  - perceptions
KW  - knowledge
KW  - medical students
KW  - faculty
KW  - chatbot
KW  - qualitative study
KW  - artificial intelligence
KW  - AI
KW  - AI-based tools
KW  - universities
KW  - thematic analysis
KW  - learning
KW  - satisfaction
KW  - LLM
KW  - large language model
UR  - https://mededu.jmir.org/2025/1/e72998
UR  - http://dx.doi.org/10.2196/72998
ID  - info:doi/10.2196/72998
ER  - 

TY  - JOUR
AU  - Montagna, Marco
AU  - Chiabrando, Filippo
AU  - De Lorenzo, Rebecca
AU  - Rovere Querini, Patrizia
AU  - 
PY  - 2025/3/18
TI  - Impact of Clinical Decision Support Systems on Medical Students? Case-Solving Performance: Comparison Study with a Focus Group
JO  - JMIR Med Educ
SP  - e55709
VL  - 11
KW  - chatGPT
KW  - chatbot
KW  - machine learning
KW  - ML
KW  - artificial intelligence
KW  - AI
KW  - algorithm
KW  - predictive model
KW  - predictive analytics
KW  - predictive system
KW  - practical model
KW  - deep learning
KW  - large language models
KW  - LLMs
KW  - medical education
KW  - medical teaching
KW  - teaching environment
KW  - clinical decision support systems
KW  - CDSS
KW  - decision support
KW  - decision support tool
KW  - clinical decision-making
KW  - innovative teaching
N2  - Background: Health care practitioners use clinical decision support systems (CDSS) as an aid in the crucial task of clinical reasoning and decision-making. Traditional CDSS are online repositories (ORs) and clinical practice guidelines (CPG). Recently, large language models (LLMs) such as ChatGPT have emerged as potential alternatives. They have proven to be powerful, innovative tools, yet they are not devoid of worrisome risks. Objective: This study aims to explore how medical students perform in an evaluated clinical case through the use of different CDSS tools. Methods: The authors randomly divided medical students into 3 groups, CPG, n=6 (38%); OR, n=5 (31%); and ChatGPT, n=5 (31%); and assigned each group a different type of CDSS for guidance in answering prespecified questions, assessing how students? speed and ability at resolving the same clinical case varied accordingly. External reviewers evaluated all answers based on accuracy and completeness metrics (score: 1?5). The authors analyzed and categorized group scores according to the skill investigated: differential diagnosis, diagnostic workup, and clinical decision-making. Results: Answering time showed a trend for the ChatGPT group to be the fastest. The mean scores for completeness were as follows: CPG 4.0, OR 3.7, and ChatGPT 3.8 (P=.49). The mean scores for accuracy were as follows: CPG 4.0, OR 3.3, and ChatGPT 3.7 (P=.02). Aggregating scores according to the 3 students? skill domains, trends in differences among the groups emerge more clearly, with the CPG group that performed best in nearly all domains and maintained almost perfect alignment between its completeness and accuracy. Conclusions: This hands-on session provided valuable insights into the potential perks and associated pitfalls of LLMs in medical education and practice. It suggested the critical need to include teachings in medical degree courses on how to properly take advantage of LLMs, as the potential for misuse is evident and real. 
UR  - https://mededu.jmir.org/2025/1/e55709
UR  - http://dx.doi.org/10.2196/55709
ID  - info:doi/10.2196/55709
ER  - 

TY  - JOUR
AU  - Monzon, Noahlana
AU  - Hays, Alan Franklin
PY  - 2025/3/11
TI  - Leveraging Generative Artificial Intelligence to Improve Motivation and Retrieval in Higher Education Learners
JO  - JMIR Med Educ
SP  - e59210
VL  - 11
KW  - educational technology
KW  - retrieval practice
KW  - flipped classroom
KW  - cognitive engagement
KW  - personalized learning
KW  - generative artificial intelligence
KW  - higher education
KW  - university education
KW  - learners
KW  - instructors
KW  - curriculum structure
KW  - learning
KW  - technologies
KW  - innovation
KW  - academic misconduct
KW  - gamification
KW  - self-directed
KW  - socio-economic disparities
KW  - interactive approach
KW  - medical education
KW  - chatGPT
KW  - machine learning
KW  - AI
KW  - large language models
UR  - https://mededu.jmir.org/2025/1/e59210
UR  - http://dx.doi.org/10.2196/59210
ID  - info:doi/10.2196/59210
ER  - 

TY  - JOUR
AU  - Doru, Berin
AU  - Maier, Christoph
AU  - Busse, Sophie Johanna
AU  - Lücke, Thomas
AU  - Schönhoff, Judith
AU  - Enax- Krumova, Elena
AU  - Hessler, Steffen
AU  - Berger, Maria
AU  - Tokic, Marianne
PY  - 2025/3/3
TI  - Detecting Artificial Intelligence?Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study
JO  - JMIR Med Educ
SP  - e62779
VL  - 11
KW  - artificial intelligence
KW  - ChatGPT
KW  - large language models
KW  - textual analysis
KW  - writing style
KW  - AI
KW  - chatbot
KW  - LLMs
KW  - detection
KW  - authorship
KW  - medical student
KW  - linguistic quality
KW  - decision-making
KW  - logical coherence
N2  - Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)?generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount. Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity?medical professionals and humanities scholars with expertise in textual analysis?to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features. Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants? characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text?s authorship. Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features?particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)?played a crucial role in participants? decisions to identify a text as AI-generated. Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts? familiarity with the text content. As the decision-making process primarily relies on linguistic attributes?such as stylistic features and text coherence?further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers? ability to distinguish between student-authored and AI-generated work. 
UR  - https://mededu.jmir.org/2025/1/e62779
UR  - http://dx.doi.org/10.2196/62779
UR  - http://www.ncbi.nlm.nih.gov/pubmed/40053752
ID  - info:doi/10.2196/62779
ER  - 

TY  - JOUR
AU  - Abouammoh, Noura
AU  - Alhasan, Khalid
AU  - Aljamaan, Fadi
AU  - Raina, Rupesh
AU  - Malki, H. Khalid
AU  - Altamimi, Ibraheem
AU  - Muaygil, Ruaim
AU  - Wahabi, Hayfaa
AU  - Jamal, Amr
AU  - Alhaboob, Ali
AU  - Assiri, Assad Rasha
AU  - Al-Tawfiq, A. Jaffar
AU  - Al-Eyadhy, Ayman
AU  - Soliman, Mona
AU  - Temsah, Mohamad-Hani
PY  - 2025/2/20
TI  - Perceptions and Earliest Experiences of Medical Students and Faculty With ChatGPT in Medical Education: Qualitative Study
JO  - JMIR Med Educ
SP  - e63400
VL  - 11
KW  - ChatGPT
KW  - medical education
KW  - Saudi Arabia
KW  - perceptions
KW  - knowledge
KW  - medical students
KW  - faculty
KW  - chatbot
KW  - qualitative study
KW  - artificial intelligence
KW  - AI
KW  - AI-based tools
KW  - universities
KW  - thematic analysis
KW  - learning
KW  - satisfaction
N2  - Background: With the rapid development of artificial intelligence technologies, there is a growing interest in the potential use of artificial intelligence?based tools like ChatGPT in medical education. However, there is limited research on the initial perceptions and experiences of faculty and students with ChatGPT, particularly in Saudi Arabia. Objective: This study aimed to explore the earliest knowledge, perceived benefits, concerns, and limitations of using ChatGPT in medical education among faculty and students at a leading Saudi Arabian university. Methods: A qualitative exploratory study was conducted in April 2023, involving focused meetings with medical faculty and students with varying levels of ChatGPT experience. A thematic analysis was used to identify key themes and subthemes emerging from the discussions. Results: Participants demonstrated good knowledge of ChatGPT and its functions. The main themes were perceptions of ChatGPT use, potential benefits, and concerns about ChatGPT in research and medical education. The perceived benefits included collecting and summarizing information and saving time and effort. However, concerns and limitations centered around the potential lack of critical thinking in the information provided, the ambiguity of references, limitations of access, trust in the output of ChatGPT, and ethical concerns. Conclusions: This study provides valuable insights into the perceptions and experiences of medical faculty and students regarding the use of newly introduced large language models like ChatGPT in medical education. While the benefits of ChatGPT were recognized, participants also expressed concerns and limitations requiring further studies for effective integration into medical education, exploring the impact of ChatGPT on learning outcomes, student and faculty satisfaction, and the development of critical thinking skills. 
UR  - https://mededu.jmir.org/2025/1/e63400
UR  - http://dx.doi.org/10.2196/63400
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39977012
ID  - info:doi/10.2196/63400
ER  - 

TY  - JOUR
AU  - Ichikawa, Tsunagu
AU  - Olsen, Elizabeth
AU  - Vinod, Arathi
AU  - Glenn, Noah
AU  - Hanna, Karim
AU  - Lund, C. Gregg
AU  - Pierce-Talsma, Stacey
PY  - 2025/2/11
TI  - Generative Artificial Intelligence in Medical Education?Policies and Training at US Osteopathic Medical Schools: Descriptive Cross-Sectional Survey
JO  - JMIR Med Educ
SP  - e58766
VL  - 11
KW  - artificial intelligence
KW  - medical education
KW  - faculty development
KW  - policy
KW  - AI
KW  - training
KW  - United States
KW  - school
KW  - university
KW  - college
KW  - institution
KW  - osteopathic
KW  - osteopathy
KW  - curriculum
KW  - student
KW  - faculty
KW  - administrator
KW  - survey
KW  - cross-sectional
N2  - Background: Interest has recently increased in generative artificial intelligence (GenAI), a subset of artificial intelligence that can create new content. Although the publicly available GenAI tools are not specifically trained in the medical domain, they have demonstrated proficiency in a wide range of medical assessments. The future integration of GenAI in medicine remains unknown. However, the rapid availability of GenAI with a chat interface and the potential risks and benefits are the focus of great interest. As with any significant medical advancement or change, medical schools must adapt their curricula to equip students with the skills necessary to become successful physicians. Furthermore, medical schools must ensure that faculty members have the skills to harness these new opportunities to increase their effectiveness as educators. How medical schools currently fulfill their responsibilities is unclear. Colleges of Osteopathic Medicine (COMs) in the United States currently train a significant proportion of the total number of medical students. These COMs are in academic settings ranging from large public research universities to small private institutions. Therefore, studying COMs will offer a representative sample of the current GenAI integration in medical education. Objective: This study aims to describe the policies and training regarding the specific aspect of GenAI in US COMs, targeting students, faculty, and administrators. Methods: Web-based surveys were sent to deans and Student Government Association (SGA) presidents of the main campuses of fully accredited US COMs. The dean survey included questions regarding current and planned policies and training related to GenAI for students, faculty, and administrators. The SGA president survey included only those questions related to current student policies and training. Results: Responses were received from 81% (26/32) of COMs surveyed. This included 47% (15/32) of the deans and 50% (16/32) of the SGA presidents (with 5 COMs represented by both the deans and the SGA presidents). Most COMs did not have a policy on the student use of GenAI, as reported by the dean (14/15, 93%) and the SGA president (14/16, 88%). Of the COMs with no policy, 79% (11/14) had no formal plans for policy development. Only 1 COM had training for students, which focused entirely on the ethics of using GenAI. Most COMs had no formal plans to provide mandatory (11/14, 79%) or elective (11/15, 73%) training. No COM had GenAI policies for faculty or administrators. Eighty percent had no formal plans for policy development. Furthermore, 33.3% (5/15) of COMs had faculty or administrator GenAI training. Except for examination question development, there was no training to increase faculty or administrator capabilities and efficiency or to decrease their workload. Conclusions: The survey revealed that most COMs lack GenAI policies and training for students, faculty, and administrators. The few institutions with policies or training were extremely limited in scope. Most institutions without current training or policies had no formal plans for development. The lack of current policies and training initiatives suggests inadequate preparedness for integrating GenAI into the medical school environment, therefore, relegating the responsibility for ethical guidance and training to the individual COM member. 
UR  - https://mededu.jmir.org/2025/1/e58766
UR  - http://dx.doi.org/10.2196/58766
ID  - info:doi/10.2196/58766
ER  - 

TY  - JOUR
AU  - Elhassan, Elwaleed Safia
AU  - Sajid, Raihan Muhammad
AU  - Syed, Mariam Amina
AU  - Fathima, Afreen Sidrah
AU  - Khan, Shehroz Bushra
AU  - Tamim, Hala
PY  - 2025/1/30
TI  - Assessing Familiarity, Usage Patterns, and Attitudes of Medical Students Toward ChatGPT and Other Chat-Based AI Apps in Medical Education: Cross-Sectional Questionnaire Study
JO  - JMIR Med Educ
SP  - e63065
VL  - 11
KW  - ChatGPT
KW  - artificial intelligence
KW  - large language model
KW  - medical students
KW  - ethics
KW  - chat-based
KW  - AI apps
KW  - medical education
KW  - social media
KW  - attitude
KW  - AI
N2  - Background: There has been a rise in the popularity of ChatGPT and other chat-based artificial intelligence (AI) apps in medical education. Despite data being available from other parts of the world, there is a significant lack of information on this topic in medical education and research, particularly in Saudi Arabia. Objective: The primary objective of the study was to examine the familiarity, usage patterns, and attitudes of Alfaisal University medical students toward ChatGPT and other chat-based AI apps in medical education. Methods: This was a cross-sectional study conducted from October 8, 2023, through November 22, 2023. A questionnaire was distributed through social media channels to medical students at Alfaisal University who were 18 years or older. Current Alfaisal University medical students in years 1 through 6, of both genders, were exclusively targeted by the questionnaire. The study was approved by Alfaisal University Institutional Review Board. A ?2 test was conducted to assess the relationships between gender, year of study, familiarity, and reasons for usage. Results: A total of 293 responses were received, of which 95 (32.4%) were from men and 198 (67.6%) were from women. There were 236 (80.5%) responses from preclinical students and 57 (19.5%) from clinical students, respectively. Overall, males (n=93, 97.9%) showed more familiarity with ChatGPT compared to females (n=180, 90.09%; P=.03). Additionally, males also used Google Bard and Microsoft Bing ChatGPT more than females (P<.001). Clinical-year students used ChatGPT significantly more for general writing purposes compared to preclinical students (P=.005). Additionally, 136 (46.4%) students believed that using ChatGPT and other chat-based AI apps for coursework was ethical, 86 (29.4%) were neutral, and 71 (24.2%) considered it unethical (all Ps>.05). Conclusions: Familiarity with and usage of ChatGPT and other chat-based AI apps were common among the students of Alfaisal University. The usage patterns of these apps differ between males and females and between preclinical and clinical-year students. 
UR  - https://mededu.jmir.org/2025/1/e63065
UR  - http://dx.doi.org/10.2196/63065
ID  - info:doi/10.2196/63065
ER  - 

TY  - JOUR
AU  - Kaewboonlert, Naritsaret
AU  - Poontananggul, Jiraphon
AU  - Pongsuwan, Natthipong
AU  - Bhakdisongkhram, Gun
PY  - 2025/1/13
TI  - Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e58898
VL  - 11
KW  - accuracy
KW  - performance
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - large language model
KW  - LLM
KW  - difficulty index
KW  - basic medical science examination
KW  - cross-sectional study
KW  - medical education
KW  - datasets
KW  - assessment
KW  - medical science
KW  - tool
KW  - Google
N2  - Background: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. Objective: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. Methods: We used questions that were closely aligned with the content and topic distribution of Thailand?s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). Results: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%?92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%?87.80%), GPT-3.5 at 67.02% (95% CI 61.20%?72.48%), and Google Bard at 63.83% (95% CI 57.92%?69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. Conclusions: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item?s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts. 
UR  - https://mededu.jmir.org/2025/1/e58898
UR  - http://dx.doi.org/10.2196/58898
ID  - info:doi/10.2196/58898
ER  - 

TY  - JOUR
AU  - Dzuali, Fiatsogbe
AU  - Seiger, Kira
AU  - Novoa, Roberto
AU  - Aleshin, Maria
AU  - Teng, Joyce
AU  - Lester, Jenna
AU  - Daneshjou, Roxana
PY  - 2024/12/10
TI  - ChatGPT May Improve Access to Language-Concordant Care for Patients With Non?English Language Preferences
JO  - JMIR Med Educ
SP  - e51435
VL  - 10
KW  - ChatGPT
KW  - artificial intelligence
KW  - language
KW  - translation
KW  - health care disparity
KW  - natural language model
KW  - survey
KW  - patient education
KW  - preference
KW  - human language
KW  - language-concordant care
UR  - https://mededu.jmir.org/2024/1/e51435
UR  - http://dx.doi.org/10.2196/51435
ID  - info:doi/10.2196/51435
ER  - 

TY  - JOUR
AU  - Huang, Ting-Yun
AU  - Hsieh, Hsing Pei
AU  - Chang, Yung-Chun
PY  - 2024/11/21
TI  - Performance Comparison of Junior Residents and ChatGPT in the Objective Structured Clinical Examination (OSCE) for Medical History Taking and Documentation of Medical Records: Development and Usability Study
JO  - JMIR Med Educ
SP  - e59902
VL  - 10
KW  - large language model
KW  - medical history taking
KW  - clinical documentation
KW  - simulation-based evaluation
KW  - OSCE standards
KW  - LLM
N2  - Background: This study explores the cutting-edge abilities of large language models (LLMs) such as ChatGPT in medical history taking and medical record documentation, with a focus on their practical effectiveness in clinical settings?an area vital for the progress of medical artificial intelligence. Objective: Our aim was to assess the capability of ChatGPT versions 3.5 and 4.0 in performing medical history taking and medical record documentation in simulated clinical environments. The study compared the performance of nonmedical individuals using ChatGPT with that of junior medical residents. Methods: A simulation involving standardized patients was designed to mimic authentic medical history?taking interactions. Five nonmedical participants used ChatGPT versions 3.5 and 4.0 to conduct medical histories and document medical records, mirroring the tasks performed by 5 junior residents in identical scenarios. A total of 10 diverse scenarios were examined. Results: Evaluation of the medical documentation created by laypersons with ChatGPT assistance and those created by junior residents was conducted by 2 senior emergency physicians using audio recordings and the final medical records. The assessment used the Objective Structured Clinical Examination benchmarks in Taiwan as a reference. ChatGPT-4.0 exhibited substantial enhancements over its predecessor and met or exceeded the performance of human counterparts in terms of both checklist and global assessment scores. Although the overall quality of human consultations remained higher, ChatGPT-4.0?s proficiency in medical documentation was notably promising. Conclusions: The performance of ChatGPT 4.0 was on par with that of human participants in Objective Structured Clinical Examination evaluations, signifying its potential in medical history and medical record documentation. Despite this, the superiority of human consultations in terms of quality was evident. The study underscores both the promise and the current limitations of LLMs in the realm of clinical practice. 
UR  - https://mededu.jmir.org/2024/1/e59902
UR  - http://dx.doi.org/10.2196/59902
ID  - info:doi/10.2196/59902
ER  - 

TY  - JOUR
AU  - Ehrett, Carl
AU  - Hegde, Sudeep
AU  - Andre, Kwame
AU  - Liu, Dixizi
AU  - Wilson, Timothy
PY  - 2024/11/19
TI  - Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e51433
VL  - 10
KW  - data augmentation
KW  - large language models
KW  - medical education
KW  - natural language processing
KW  - data security
KW  - ethics
KW  - AI
KW  - artificial intelligence
KW  - data privacy
KW  - medical staff
N2  - Background: Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI?s ChatGPT. Objective: This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods: The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results: The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers? performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions: The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field. 
UR  - https://mededu.jmir.org/2024/1/e51433
UR  - http://dx.doi.org/10.2196/51433
ID  - info:doi/10.2196/51433
ER  - 

TY  - JOUR
AU  - Zhou, You
AU  - Li, Si-Jia
AU  - Tang, Xing-Yi
AU  - He, Yi-Chen
AU  - Ma, Hao-Ming
AU  - Wang, Ao-Qi
AU  - Pei, Run-Yuan
AU  - Piao, Mei-Hua
PY  - 2024/11/19
TI  - Using ChatGPT in Nursing: Scoping Review of Current Opinions
JO  - JMIR Med Educ
SP  - e54297
VL  - 10
KW  - ChatGPT
KW  - large language model
KW  - nursing
KW  - artificial intelligence
KW  - scoping review
KW  - generative AI
KW  - nursing education
N2  - Background: Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields. Objective: We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT?s strengths, weaknesses, and the potential impacts it may cause. Methods: This scoping review was conducted following the framework of Arksey and O?Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently. Results: A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on ?ChatGPT and nursing education? (20 studies), ?ChatGPT and nursing practice? (10 studies), and ?ChatGPT and nursing research, writing, and examination? (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields. Conclusions: As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice. 
UR  - https://mededu.jmir.org/2024/1/e54297
UR  - http://dx.doi.org/10.2196/54297
ID  - info:doi/10.2196/54297
ER  - 

TY  - JOUR
AU  - Ros-Arlanzón, Pablo
AU  - Perez-Sempere, Angel
PY  - 2024/11/14
TI  - Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain
JO  - JMIR Med Educ
SP  - e56762
VL  - 10
KW  - artificial intelligence
KW  - ChatGPT
KW  - clinical decision-making
KW  - medical education
KW  - medical knowledge assessment
KW  - OpenAI
N2  - Background: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Objective: This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI?s capabilities and limitations in medical knowledge. Methods: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom?s Taxonomy. Statistical analysis of performance, including the ? coefficient for response consistency, was performed. Results: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher ? coefficient of 0.73, compared to ChatGPT-3.5?s coefficient of 0.69. Conclusions: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4?s performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment. 
UR  - https://mededu.jmir.org/2024/1/e56762
UR  - http://dx.doi.org/10.2196/56762
ID  - info:doi/10.2196/56762
ER  - 

TY  - JOUR
AU  - Goodings, James Anthony
AU  - Kajitani, Sten
AU  - Chhor, Allison
AU  - Albakri, Ahmad
AU  - Pastrak, Mila
AU  - Kodancha, Megha
AU  - Ives, Rowan
AU  - Lee, Bin Yoo
AU  - Kajitani, Kari
PY  - 2024/10/8
TI  - Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study
JO  - JMIR Med Educ
SP  - e56128
VL  - 10
KW  - ChatGPT-4
KW  - Family Medicine Board Examination
KW  - artificial intelligence in medical education
KW  - AI performance assessment
KW  - prompt engineering
KW  - ChatGPT
KW  - artificial intelligence
KW  - AI
KW  - medical education
KW  - assessment
KW  - observational
KW  - analytical method
KW  - data analysis
KW  - examination
N2  - Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, ?AI Family Medicine Board Exam Taker,? designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI?s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4?s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4?s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. 
UR  - https://mededu.jmir.org/2024/1/e56128
UR  - http://dx.doi.org/10.2196/56128
ID  - info:doi/10.2196/56128
ER  - 

TY  - JOUR
AU  - Wu, Zelin
AU  - Gan, Wenyi
AU  - Xue, Zhaowen
AU  - Ni, Zhengxin
AU  - Zheng, Xiaofei
AU  - Zhang, Yiyi
PY  - 2024/10/3
TI  - Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e52746
VL  - 10
KW  - artificial intelligence
KW  - ChatGPT
KW  - nursing licensure examination
KW  - nursing
KW  - LLMs
KW  - large language models
KW  - nursing education
KW  - AI
KW  - nursing student
KW  - large language model
KW  - licensing
KW  - observation
KW  - observational study
KW  - China
KW  - USA
KW  - United States of America
KW  - auxiliary tool
KW  - accuracy rate
KW  - theoretical
N2  - Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT?s performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5?s Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making. 
UR  - https://mededu.jmir.org/2024/1/e52746
UR  - http://dx.doi.org/10.2196/52746
ID  - info:doi/10.2196/52746
ER  - 

TY  - JOUR
AU  - Claman, Daniel
AU  - Sezgin, Emre
PY  - 2024/9/27
TI  - Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models
JO  - JMIR Med Educ
SP  - e52346
VL  - 10
KW  - artificial intelligence
KW  - large language models
KW  - dental education
KW  - GPT
KW  - ChatGPT
KW  - periodontal health
KW  - AI
KW  - LLM
KW  - LLMs
KW  - chatbot
KW  - natural language
KW  - generative pretrained transformer
KW  - innovation
KW  - technology
KW  - large language model
UR  - https://mededu.jmir.org/2024/1/e52346
UR  - http://dx.doi.org/10.2196/52346
ID  - info:doi/10.2196/52346
ER  - 

TY  - JOUR
AU  - Yoon, Soo-Hyuk
AU  - Oh, Kyeong Seok
AU  - Lim, Gun Byung
AU  - Lee, Ho-Jin
PY  - 2024/9/16
TI  - Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study
JO  - JMIR Med Educ
SP  - e56859
VL  - 10
KW  - AI tools
KW  - problem solving
KW  - anesthesiology
KW  - artificial intelligence
KW  - pain medicine
KW  - ChatGPT
KW  - health care
KW  - medical education
KW  - South Korea
N2  - Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4?s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. 
UR  - https://mededu.jmir.org/2024/1/e56859
UR  - http://dx.doi.org/10.2196/56859
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/56859
ER  - 

TY  - JOUR
AU  - Zaghir, Jamil
AU  - Naguib, Marco
AU  - Bjelogrlic, Mina
AU  - Névéol, Aurélie
AU  - Tannier, Xavier
AU  - Lovis, Christian
PY  - 2024/9/10
TI  - Prompt Engineering Paradigms for Medical Applications: Scoping Review
JO  - J Med Internet Res
SP  - e60501
VL  - 26
KW  - prompt engineering
KW  - prompt design
KW  - prompt learning
KW  - prompt tuning
KW  - large language models
KW  - LLMs
KW  - scoping review
KW  - clinical natural language processing
KW  - natural language processing
KW  - NLP
KW  - medical texts
KW  - medical application
KW  - medical applications
KW  - clinical practice
KW  - privacy
KW  - medicine
KW  - computer science
KW  - medical informatics
N2  - Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering?based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering?specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field. 
UR  - https://www.jmir.org/2024/1/e60501
UR  - http://dx.doi.org/10.2196/60501
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39255030
ID  - info:doi/10.2196/60501
ER  - 

TY  - JOUR
AU  - Reis, Florian
AU  - Lenz, Christian
AU  - Gossen, Manfred
AU  - Volk, Hans-Dieter
AU  - Drzeniek, Michael Norman
PY  - 2024/9/5
TI  - Practical Applications of Large Language Models for Health Care Professionals and Scientists
JO  - JMIR Med Inform
SP  - e58478
VL  - 12
KW  - artificial intelligence
KW  - healthcare
KW  - chatGPT
KW  - large language model
KW  - prompting
KW  - LLM
KW  - applications
KW  - AI
KW  - scientists
KW  - physicians
KW  - health care
UR  - https://medinform.jmir.org/2024/1/e58478
UR  - http://dx.doi.org/10.2196/58478
ID  - info:doi/10.2196/58478
ER  - 

TY  - JOUR
AU  - Xu, Tianhui
AU  - Weng, Huiting
AU  - Liu, Fang
AU  - Yang, Li
AU  - Luo, Yuanyuan
AU  - Ding, Ziwei
AU  - Wang, Qin
PY  - 2024/8/28
TI  - Current Status of ChatGPT Use in Medical Education: Potentials, Challenges, and Strategies
JO  - J Med Internet Res
SP  - e57896
VL  - 26
KW  - chat generative pretrained transformer
KW  - ChatGPT
KW  - artificial intelligence
KW  - medical education
KW  - natural language processing
KW  - clinical practice
UR  - https://www.jmir.org/2024/1/e57896
UR  - http://dx.doi.org/10.2196/57896
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39196640
ID  - info:doi/10.2196/57896
ER  - 

TY  - JOUR
AU  - Thomae, V. Anita
AU  - Witt, M. Claudia
AU  - Barth, Jürgen
PY  - 2024/8/22
TI  - Integration of ChatGPT Into a Course for Medical Students: Explorative Study on Teaching Scenarios, Students? Perception, and Applications
JO  - JMIR Med Educ
SP  - e50545
VL  - 10
KW  - medical education
KW  - ChatGPT
KW  - artificial intelligence
KW  - information for patients
KW  - critical appraisal
KW  - evaluation
KW  - blended learning
KW  - AI
KW  - digital skills
KW  - teaching
N2  - Background: Text-generating artificial intelligence (AI) such as ChatGPT offers many opportunities and challenges in medical education. Acquiring practical skills necessary for using AI in a clinical context is crucial, especially for medical education. Objective: This explorative study aimed to investigate the feasibility of integrating ChatGPT into teaching units and to evaluate the course and the importance of AI-related competencies for medical students. Since a possible application of ChatGPT in the medical field could be the generation of information for patients, we further investigated how such information is perceived by students in terms of persuasiveness and quality. Methods: ChatGPT was integrated into 3 different teaching units of a blended learning course for medical students. Using a mixed methods approach, quantitative and qualitative data were collected. As baseline data, we assessed students? characteristics, including their openness to digital innovation. The students evaluated the integration of ChatGPT into the course and shared their thoughts regarding the future of text-generating AI in medical education. The course was evaluated based on the Kirkpatrick Model, with satisfaction, learning progress, and applicable knowledge considered as key assessment levels. In ChatGPT-integrating teaching units, students evaluated videos featuring information for patients regarding their persuasiveness on treatment expectations in a self-experience experiment and critically reviewed information for patients written using ChatGPT 3.5 based on different prompts. Results: A total of 52 medical students participated in the study. The comprehensive evaluation of the course revealed elevated levels of satisfaction, learning progress, and applicability specifically in relation to the ChatGPT-integrating teaching units. Furthermore, all evaluation levels demonstrated an association with each other. Higher openness to digital innovation was associated with higher satisfaction and, to a lesser extent, with higher applicability. AI-related competencies in other courses of the medical curriculum were perceived as highly important by medical students. Qualitative analysis highlighted potential use cases of ChatGPT in teaching and learning. In ChatGPT-integrating teaching units, students rated information for patients generated using a basic ChatGPT prompt as ?moderate? in terms of comprehensibility, patient safety, and the correct application of communication rules taught during the course. The students? ratings were considerably improved using an extended prompt. The same text, however, showed the smallest increase in treatment expectations when compared with information provided by humans (patient, clinician, and expert) via videos. Conclusions: This study offers valuable insights into integrating the development of AI competencies into a blended learning course. Integration of ChatGPT enhanced learning experiences for medical students. 
UR  - https://mededu.jmir.org/2024/1/e50545
UR  - http://dx.doi.org/10.2196/50545
ID  - info:doi/10.2196/50545
ER  - 

TY  - JOUR
AU  - Holderried, Friederike
AU  - Stegemann-Philipps, Christian
AU  - Herrmann-Werner, Anne
AU  - Festl-Wietek, Teresa
AU  - Holderried, Martin
AU  - Eickhoff, Carsten
AU  - Mahling, Moritz
PY  - 2024/8/16
TI  - A Language Model?Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study
JO  - JMIR Med Educ
SP  - e59213
VL  - 10
KW  - virtual patients communication
KW  - communication skills
KW  - technology enhanced education
KW  - TEL
KW  - medical education
KW  - ChatGPT
KW  - GPT: LLM
KW  - LLMs
KW  - NLP
KW  - natural language processing
KW  - machine learning
KW  - artificial intelligence
KW  - language model
KW  - language models
KW  - communication
KW  - relationship
KW  - relationships
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - history
KW  - histories
KW  - simulated
KW  - student
KW  - students
KW  - interaction
KW  - interactions
N2  - Background: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback. Objective: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students? performance in history taking with a simulated patient. Methods: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients? responses and provide immediate feedback on the comprehensiveness of the students? history taking. Students? interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback. Results: Most of the study?s participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4?s role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed ?almost perfect? agreement (Cohen ?=0.832). Less agreement (?<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model?s assessments were overly specific or diverged from human judgement. Conclusions: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context. 
UR  - https://mededu.jmir.org/2024/1/e59213
UR  - http://dx.doi.org/10.2196/59213
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/59213
ER  - 

TY  - JOUR
AU  - Ming, Shuai
AU  - Guo, Qingge
AU  - Cheng, Wenjun
AU  - Lei, Bo
PY  - 2024/8/13
TI  - Influence of Model Evolution and System Roles on ChatGPT?s Performance in Chinese Medical Licensing Exams: Comparative Study
JO  - JMIR Med Educ
SP  - e52784
VL  - 10
KW  - ChatGPT
KW  - Chinese National Medical Licensing Examination
KW  - large language models
KW  - medical education
KW  - system role
KW  - LLM
KW  - LLMs
KW  - language model
KW  - language models
KW  - artificial intelligence
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - exam
KW  - exams
KW  - examination
KW  - examinations
KW  - OpenAI
KW  - answer
KW  - answers
KW  - response
KW  - responses
KW  - accuracy
KW  - performance
KW  - China
KW  - Chinese
N2  - Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt?s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The ?2 tests and ? values were employed to evaluate the model?s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with ? values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%?3.7%) and GPT-3.5 (1.3%?4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model?s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. 
UR  - https://mededu.jmir.org/2024/1/e52784
UR  - http://dx.doi.org/10.2196/52784
ID  - info:doi/10.2196/52784
ER  - 

TY  - JOUR
AU  - Cherrez-Ojeda, Ivan
AU  - Gallardo-Bastidas, C. Juan
AU  - Robles-Velasco, Karla
AU  - Osorio, F. María
AU  - Velez Leon, Maria Eleonor
AU  - Leon Velastegui, Manuel
AU  - Pauletto, Patrícia
AU  - Aguilar-Díaz, C. F.
AU  - Squassi, Aldo
AU  - González Eras, Patricia Susana
AU  - Cordero Carrasco, Erita
AU  - Chavez Gonzalez, Leonor Karol
AU  - Calderon, C. Juan
AU  - Bousquet, Jean
AU  - Bedbrook, Anna
AU  - Faytong-Haro, Marco
PY  - 2024/8/13
TI  - Understanding Health Care Students? Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e51757
VL  - 10
KW  - artificial intelligence
KW  - ChatGPT
KW  - education
KW  - health care
KW  - students
N2  - Background: ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. Objective: The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants? attitudes toward the use of ChatGPT. Methods: A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. Results: Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was ?minimal? (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) ?somewhat agreed? that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). Conclusions: Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs. 
UR  - https://mededu.jmir.org/2024/1/e51757
UR  - http://dx.doi.org/10.2196/51757
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39137029
ID  - info:doi/10.2196/51757
ER  - 

TY  - JOUR
AU  - Takahashi, Hiromizu
AU  - Shikino, Kiyoshi
AU  - Kondo, Takeshi
AU  - Komori, Akira
AU  - Yamada, Yuji
AU  - Saita, Mizue
AU  - Naito, Toshio
PY  - 2024/8/13
TI  - Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e59133
VL  - 10
KW  - generative AI
KW  - ChatGPT-4
KW  - medical case generation
KW  - medical education
KW  - clinical vignettes
KW  - AI
KW  - artificial intelligence
KW  - Japanese
KW  - Japan
N2  - Background: Evaluating the accuracy and educational utility of artificial intelligence?generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. Objective: This study aimed to assess the educational utility of ChatGPT-4?generated clinical vignettes and their applicability in educational settings. Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians? experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. Conclusions: ChatGPT-4?generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4?s value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application. 
UR  - https://mededu.jmir.org/2024/1/e59133
UR  - http://dx.doi.org/10.2196/59133
UR  - http://www.ncbi.nlm.nih.gov/pubmed/39137031
ID  - info:doi/10.2196/59133
ER  - 

TY  - JOUR
AU  - Zhui, Li
AU  - Fenghe, Li
AU  - Xuehu, Wang
AU  - Qining, Fu
AU  - Wei, Ren
PY  - 2024/8/1
TI  - Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint
JO  - J Med Internet Res
SP  - e60083
VL  - 26
KW  - medical education
KW  - artificial intelligence
KW  - large language models
KW  - medical ethics
KW  - AI
KW  - LLMs
KW  - ethics
KW  - academic integrity
KW  - privacy and data risks
KW  - data security
KW  - data protection
KW  - intellectual property rights
KW  - educational research
UR  - https://www.jmir.org/2024/1/e60083
UR  - http://dx.doi.org/10.2196/60083
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38971715
ID  - info:doi/10.2196/60083
ER  - 

TY  - JOUR
AU  - Burke, B. Harry
AU  - Hoang, Albert
AU  - Lopreiato, O. Joseph
AU  - King, Heidi
AU  - Hemmer, Paul
AU  - Montgomery, Michael
AU  - Gagarin, Viktoria
PY  - 2024/7/25
TI  - Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study
JO  - JMIR Med Educ
SP  - e56342
VL  - 10
KW  - medical education
KW  - generative artificial intelligence
KW  - natural language processing
KW  - ChatGPT
KW  - generative pretrained transformer
KW  - standardized patients
KW  - clinical notes
KW  - free-text notes
KW  - history and physical examination
KW  - large language model
KW  - LLM
KW  - medical student
KW  - medical students
KW  - clinical information
KW  - artificial intelligence
KW  - AI
KW  - patients
KW  - patient
KW  - medicine
N2  - Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students? free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students? notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students? standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. 
UR  - https://mededu.jmir.org/2024/1/e56342
UR  - http://dx.doi.org/10.2196/56342
ID  - info:doi/10.2196/56342
ER  - 

TY  - JOUR
AU  - Kamel Boulos, N. Maged
AU  - Dellavalle, Robert
PY  - 2024/7/24
TI  - NVIDIA?s ?Chat with RTX? Custom Large Language Model and Personalized AI Chatbot Augments the Value of Electronic Dermatology Reference Material
JO  - JMIR Dermatol
SP  - e58396
VL  - 7
KW  - AI chatbots
KW  - artificial intelligence
KW  - AI
KW  - generative AI
KW  - large language models
KW  - dermatology
KW  - education
KW  - self-study
KW  - NVIDIA RTX
KW  - retrieval-augmented generation
KW  - RAG
UR  - https://derma.jmir.org/2024/1/e58396
UR  - http://dx.doi.org/10.2196/58396
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/58396
ER  - 

TY  - JOUR
AU  - Cherif, Hela
AU  - Moussa, Chirine
AU  - Missaoui, Mouhaymen Abdel
AU  - Salouage, Issam
AU  - Mokaddem, Salma
AU  - Dhahri, Besma
PY  - 2024/7/23
TI  - Appraisal of ChatGPT?s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination
JO  - JMIR Med Educ
SP  - e52818
VL  - 10
KW  - medical education
KW  - ChatGPT
KW  - GPT
KW  - artificial intelligence
KW  - natural language processing
KW  - NLP
KW  - pulmonary medicine
KW  - pulmonary
KW  - lung
KW  - lungs
KW  - respiratory
KW  - respiration
KW  - pneumology
KW  - comparative analysis
KW  - large language models
KW  - LLMs
KW  - LLM
KW  - language model
KW  - generative AI
KW  - generative artificial intelligence
KW  - generative
KW  - exams
KW  - exam
KW  - examinations
KW  - examination
N2  - Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT?s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution?s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. 
UR  - https://mededu.jmir.org/2024/1/e52818
UR  - http://dx.doi.org/10.2196/52818
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/52818
ER  - 

TY  - JOUR
AU  - Skryd, Anthony
AU  - Lawrence, Katharine
PY  - 2024/5/8
TI  - ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study
JO  - JMIR Form Res
SP  - e51346
VL  - 8
KW  - ChatGPT
KW  - medical education
KW  - large language models
KW  - LLMs
KW  - clinical decision-making
N2  - Background: Large language models (LLMs) are computational artificial intelligence systems with advanced natural language processing capabilities that have recently been popularized among health care students and educators due to their ability to provide real-time access to a vast amount of medical knowledge. The adoption of LLM technology into medical education and training has varied, and little empirical evidence exists to support its use in clinical teaching environments. Objective: The aim of the study is to identify and qualitatively evaluate potential use cases and limitations of LLM technology for real-time ward-based educational contexts. Methods: A brief, single-site exploratory evaluation of the publicly available ChatGPT-3.5 (OpenAI) was conducted by implementing the tool into the daily attending rounds of a general internal medicine inpatient service at a large urban academic medical center. ChatGPT was integrated into rounds via both structured and organic use, using the web-based ?chatbot? style interface to interact with the LLM through conversational free-text and discrete queries. A qualitative approach using phenomenological inquiry was used to identify key insights related to the use of ChatGPT through analysis of ChatGPT conversation logs and associated shorthand notes from the clinical sessions. Results: Identified use cases for ChatGPT integration included addressing medical knowledge gaps through discrete medical knowledge inquiries, building differential diagnoses and engaging dual-process thinking, challenging medical axioms, using cognitive aids to support acute care decision-making, and improving complex care management by facilitating conversations with subspecialties. Potential additional uses included engaging in difficult conversations with patients, exploring ethical challenges and general medical ethics teaching, personal continuing medical education resources, developing ward-based teaching tools, supporting and automating clinical documentation, and supporting productivity and task management. LLM biases, misinformation, ethics, and health equity were identified as areas of concern and potential limitations to clinical and training use. A code of conduct on ethical and appropriate use was also developed to guide team usage on the wards. Conclusions: Overall, ChatGPT offers a novel tool to enhance ward-based learning through rapid information querying, second-order content exploration, and engaged team discussion regarding generated responses. More research is needed to fully understand contexts for educational use, particularly regarding the risks and limitations of the tool in clinical settings and its impacts on trainee development. 
UR  - https://formative.jmir.org/2024/1/e51346
UR  - http://dx.doi.org/10.2196/51346
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38717811
ID  - info:doi/10.2196/51346
ER  - 

TY  - JOUR
AU  - Rojas, Marcos
AU  - Rojas, Marcelo
AU  - Burgess, Valentina
AU  - Toro-Pérez, Javier
AU  - Salehi, Shima
PY  - 2024/4/29
TI  - Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study
JO  - JMIR Med Educ
SP  - e55048
VL  - 10
KW  - artificial intelligence
KW  - AI
KW  - generative artificial intelligence
KW  - medical education
KW  - ChatGPT
KW  - EUNACOM
KW  - medical licensure
KW  - medical license
KW  - medical licensing exam
N2  - Background: The deployment of OpenAI?s ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as ?GPT-4 Turbo With Vision?), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile?s medical licensing examinations?a critical step for medical practitioners in Chile?is less explored. This gap highlights the need to evaluate ChatGPT?s adaptability to diverse linguistic and cultural contexts. Objective: This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile. Methods: Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM?s structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. Results: All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT?s performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%). Conclusions: This study reveals ChatGPT?s ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals. 
UR  - https://mededu.jmir.org/2024/1/e55048
UR  - http://dx.doi.org/10.2196/55048
ID  - info:doi/10.2196/55048
ER  - 

TY  - JOUR
AU  - Noda, Masao
AU  - Ueno, Takayoshi
AU  - Koshu, Ryota
AU  - Takaso, Yuji
AU  - Shimada, Dias Mari
AU  - Saito, Chizu
AU  - Sugimoto, Hisashi
AU  - Fushiki, Hiroaki
AU  - Ito, Makoto
AU  - Nomura, Akihiro
AU  - Yoshizaki, Tomokazu
PY  - 2024/3/28
TI  - Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
JO  - JMIR Med Educ
SP  - e57054
VL  - 10
KW  - artificial intelligence
KW  - GPT-4v
KW  - large language model
KW  - otolaryngology
KW  - GPT
KW  - ChatGPT
KW  - LLM
KW  - LLMs
KW  - language model
KW  - language models
KW  - head
KW  - respiratory
KW  - ENT: ear
KW  - nose
KW  - throat
KW  - neck
KW  - NLP
KW  - natural language processing
KW  - image
KW  - images
KW  - exam
KW  - exams
KW  - examination
KW  - examinations
KW  - answer
KW  - answers
KW  - answering
KW  - response
KW  - responses
N2  - Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence?s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. 
UR  - https://mededu.jmir.org/2024/1/e57054
UR  - http://dx.doi.org/10.2196/57054
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38546736
ID  - info:doi/10.2196/57054
ER  - 

TY  - JOUR
AU  - Gandhi, P. Aravind
AU  - Joesph, Karen Felista
AU  - Rajagopal, Vineeth
AU  - Aparnavi, P.
AU  - Katkuri, Sushma
AU  - Dayama, Sonal
AU  - Satapathy, Prakasini
AU  - Khatib, Nazli Mahalaqua
AU  - Gaidhane, Shilpa
AU  - Zahiruddin, Syed Quazi
AU  - Behera, Ashish
PY  - 2024/3/25
TI  - Performance of ChatGPT on the India Undergraduate Community Medicine Examination: Cross-Sectional Study
JO  - JMIR Form Res
SP  - e49964
VL  - 8
KW  - artificial intelligence
KW  - ChatGPT
KW  - community medicine
KW  - India
KW  - large language model
KW  - medical education
KW  - digitalization
N2  - Background: Medical students may increasingly use large language models (LLMs) in their learning. ChatGPT is an LLM at the forefront of this new development in medical education with the capacity to respond to multidisciplinary questions. Objective: The aim of this study was to evaluate the ability of ChatGPT 3.5 to complete the Indian undergraduate medical examination in the subject of community medicine. We further compared ChatGPT scores with the scores obtained by the students. Methods: The study was conducted at a publicly funded medical college in Hyderabad, India. The study was based on the internal assessment examination conducted in January 2023 for students in the Bachelor of Medicine and Bachelor of Surgery Final Year?Part I program; the examination of focus included 40 questions (divided between two papers) from the community medicine subject syllabus. Each paper had three sections with different weightage of marks for each section: section one had two long essay?type questions worth 15 marks each, section two had 8 short essay?type questions worth 5 marks each, and section three had 10 short-answer questions worth 3 marks each. The same questions were administered as prompts to ChatGPT 3.5 and the responses were recorded. Apart from scoring ChatGPT responses, two independent evaluators explored the responses to each question to further analyze their quality with regard to three subdomains: relevancy, coherence, and completeness. Each question was scored in these subdomains on a Likert scale of 1-5. The average of the two evaluators was taken as the subdomain score of the question. The proportion of questions with a score 50% of the maximum score (5) in each subdomain was calculated. Results: ChatGPT 3.5 scored 72.3% on paper 1 and 61% on paper 2. The mean score of the 94 students was 43% on paper 1 and 45% on paper 2. The responses of ChatGPT 3.5 were also rated to be satisfactorily relevant, coherent, and complete for most of the questions (>80%). Conclusions: ChatGPT 3.5 appears to have substantial and sufficient knowledge to understand and answer the Indian medical undergraduate examination in the subject of community medicine. ChatGPT may be introduced to students to enable the self-directed learning of community medicine in pilot mode. However, faculty oversight will be required as ChatGPT is still in the initial stages of development, and thus its potential and reliability of medical content from the Indian context need to be further explored comprehensively. 
UR  - https://formative.jmir.org/2024/1/e49964
UR  - http://dx.doi.org/10.2196/49964
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38526538
ID  - info:doi/10.2196/49964
ER  - 

TY  - JOUR
AU  - Magalhães Araujo, Sabrina
AU  - Cruz-Correia, Ricardo
PY  - 2024/3/20
TI  - Incorporating ChatGPT in Medical Informatics Education: Mixed Methods Study on Student Perceptions and Experiential Integration Proposals
JO  - JMIR Med Educ
SP  - e51151
VL  - 10
KW  - education
KW  - medical informatics
KW  - artificial intelligence
KW  - AI
KW  - generative language model
KW  - ChatGPT
N2  - Background: The integration of artificial intelligence (AI) technologies, such as ChatGPT, in the educational landscape has the potential to enhance the learning experience of medical informatics students and prepare them for using AI in professional settings. The incorporation of AI in classes aims to develop critical thinking by encouraging students to interact with ChatGPT and critically analyze the responses generated by the chatbot. This approach also helps students develop important skills in the field of biomedical and health informatics to enhance their interaction with AI tools. Objective: The aim of the study is to explore the perceptions of students regarding the use of ChatGPT as a learning tool in their educational context and provide professors with examples of prompts for incorporating ChatGPT into their teaching and learning activities, thereby enhancing the educational experience for students in medical informatics courses. Methods: This study used a mixed methods approach to gain insights from students regarding the use of ChatGPT in education. To accomplish this, a structured questionnaire was applied to evaluate students? familiarity with ChatGPT, gauge their perceptions of its use, and understand their attitudes toward its use in academic and learning tasks. Learning outcomes of 2 courses were analyzed to propose ChatGPT?s incorporation in master?s programs in medicine and medical informatics. Results: The majority of students expressed satisfaction with the use of ChatGPT in education, finding it beneficial for various purposes, including generating academic content, brainstorming ideas, and rewriting text. While some participants raised concerns about potential biases and the need for informed use, the overall perception was positive. Additionally, the study proposed integrating ChatGPT into 2 specific courses in the master?s programs in medicine and medical informatics. The incorporation of ChatGPT was envisioned to enhance student learning experiences and assist in project planning, programming code generation, examination preparation, workflow exploration, and technical interview preparation, thus advancing medical informatics education. In medical teaching, it will be used as an assistant for simplifying the explanation of concepts and solving complex problems, as well as for generating clinical narratives and patient simulators. Conclusions: The study?s valuable insights into medical faculty students? perspectives and integration proposals for ChatGPT serve as an informative guide for professors aiming to enhance medical informatics education. The research delves into the potential of ChatGPT, emphasizes the necessity of collaboration in academic environments, identifies subject areas with discernible benefits, and underscores its transformative role in fostering innovative and engaging learning experiences. The envisaged proposals hold promise in empowering future health care professionals to work in the rapidly evolving era of digital health care. 
UR  - https://mededu.jmir.org/2024/1/e51151
UR  - http://dx.doi.org/10.2196/51151
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38506920
ID  - info:doi/10.2196/51151
ER  - 

TY  - JOUR
AU  - Nakao, Takahiro
AU  - Miki, Soichiro
AU  - Nakamura, Yuta
AU  - Kikuchi, Tomohiro
AU  - Nomura, Yukihiro
AU  - Hanaoka, Shouhei
AU  - Yoshikawa, Takeharu
AU  - Abe, Osamu
PY  - 2024/3/12
TI  - Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study
JO  - JMIR Med Educ
SP  - e54393
VL  - 10
KW  - AI
KW  - artificial intelligence
KW  - LLM
KW  - large language model
KW  - language model
KW  - language models
KW  - ChatGPT
KW  - GPT-4
KW  - GPT-4V
KW  - generative pretrained transformer
KW  - image
KW  - images
KW  - imaging
KW  - response
KW  - responses
KW  - exam
KW  - examination
KW  - exams
KW  - examinations
KW  - answer
KW  - answers
KW  - NLP
KW  - natural language processing
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - medical education
N2  - Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V?s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P?.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. 
UR  - https://mededu.jmir.org/2024/1/e54393
UR  - http://dx.doi.org/10.2196/54393
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38470459
ID  - info:doi/10.2196/54393
ER  - 

TY  - JOUR
AU  - Willms, Amanda
AU  - Liu, Sam
PY  - 2024/2/29
TI  - Exploring the Feasibility of Using ChatGPT to Create Just-in-Time Adaptive Physical Activity mHealth Intervention Content: Case Study
JO  - JMIR Med Educ
SP  - e51426
VL  - 10
KW  - ChatGPT
KW  - digital health
KW  - mobile health
KW  - mHealth
KW  - physical activity
KW  - application
KW  - mobile app
KW  - mobile apps
KW  - content creation
KW  - behavior change
KW  - app design
N2  - Background: Achieving physical activity (PA) guidelines? recommendation of 150 minutes of moderate-to-vigorous PA per week has been shown to reduce the risk of many chronic conditions. Despite the overwhelming evidence in this field, PA levels remain low globally. By creating engaging mobile health (mHealth) interventions through strategies such as just-in-time adaptive interventions (JITAIs) that are tailored to an individual?s dynamic state, there is potential to increase PA levels. However, generating personalized content can take a long time due to various versions of content required for the personalization algorithms. ChatGPT presents an incredible opportunity to rapidly produce tailored content; however, there is a lack of studies exploring its feasibility. Objective: This study aimed to (1) explore the feasibility of using ChatGPT to create content for a PA JITAI mobile app and (2) describe lessons learned and future recommendations for using ChatGPT in the development of mHealth JITAI content. Methods: During phase 1, we used Pathverse, a no-code app builder, and ChatGPT to develop a JITAI app to help parents support their child?s PA levels. The intervention was developed based on the Multi-Process Action Control (M-PAC) framework, and the necessary behavior change techniques targeting the M-PAC constructs were implemented in the app design to help parents support their child?s PA. The acceptability of using ChatGPT for this purpose was discussed to determine its feasibility. In phase 2, we summarized the lessons we learned during the JITAI content development process using ChatGPT and generated recommendations to inform future similar use cases. Results: In phase 1, by using specific prompts, we efficiently generated content for 13 lessons relating to increasing parental support for their child?s PA following the M-PAC framework. It was determined that using ChatGPT for this case study to develop PA content for a JITAI was acceptable. In phase 2, we summarized our recommendations into the following six steps when using ChatGPT to create content for mHealth behavior interventions: (1) determine target behavior, (2) ground the intervention in behavior change theory, (3) design the intervention structure, (4) input intervention structure and behavior change constructs into ChatGPT, (5) revise the ChatGPT response, and (6) customize the response to be used in the intervention. Conclusions: ChatGPT offers a remarkable opportunity for rapid content creation in the context of an mHealth JITAI. Although our case study demonstrated that ChatGPT was acceptable, it is essential to approach its use, along with other language models, with caution. Before delivering content to population groups, expert review is crucial to ensure accuracy and relevancy. Future research and application of these guidelines are imperative as we deepen our understanding of ChatGPT and its interactions with human input. 
UR  - https://mededu.jmir.org/2024/1/e51426
UR  - http://dx.doi.org/10.2196/51426
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38421689
ID  - info:doi/10.2196/51426
ER  - 

TY  - JOUR
AU  - Chen, Chih-Wei
AU  - Walter, Paul
AU  - Wei, Cheng-Chung James
PY  - 2024/2/27
TI  - Using ChatGPT-Like Solutions to Bridge the Communication Gap Between Patients With Rheumatoid Arthritis and Health Care Professionals
JO  - JMIR Med Educ
SP  - e48989
VL  - 10
KW  - rheumatoid arthritis
KW  - ChatGPT
KW  - artificial intelligence
KW  - communication gap
KW  - privacy
KW  - data management
UR  - https://mededu.jmir.org/2024/1/e48989
UR  - http://dx.doi.org/10.2196/48989
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38412022
ID  - info:doi/10.2196/48989
ER  - 

TY  - JOUR
AU  - Farhat, Faiza
AU  - Chaudhry, Moalla Beenish
AU  - Nadeem, Mohammad
AU  - Sohail, Saquib Shahab
AU  - Madsen, Øivind Dag
PY  - 2024/2/21
TI  - Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard
JO  - JMIR Med Educ
SP  - e51523
VL  - 10
KW  - accuracy
KW  - AI model
KW  - artificial intelligence
KW  - Bard
KW  - ChatGPT
KW  - educational task
KW  - GPT-4
KW  - Generative Pre-trained Transformers
KW  - large language models
KW  - medical education, medical exam
KW  - natural language processing
KW  - performance
KW  - premedical exams
KW  - suitability
N2  - Background: Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India. Objective: This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. Methods: In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models. Results: It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59. Conclusions: The study?s findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs? performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments. 
UR  - https://mededu.jmir.org/2024/1/e51523
UR  - http://dx.doi.org/10.2196/51523
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38381486
ID  - info:doi/10.2196/51523
ER  - 

TY  - JOUR
AU  - Abdullahi, Tassallah
AU  - Singh, Ritambhara
AU  - Eickhoff, Carsten
PY  - 2024/2/13
TI  - Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models
JO  - JMIR Med Educ
SP  - e51391
VL  - 10
KW  - clinical decision support
KW  - rare diseases
KW  - complex diseases
KW  - prompt engineering
KW  - reliability
KW  - consistency
KW  - natural language processing
KW  - language model
KW  - Bard
KW  - ChatGPT 3.5
KW  - GPT-4
KW  - MedAlpaca
KW  - medical education
KW  - complex diagnosis
KW  - artificial intelligence
KW  - AI assistance
KW  - medical training
KW  - prediction model
N2  - Background: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. Objective: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. Methods: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. Results: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. Conclusions: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model?s characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes. 
UR  - https://mededu.jmir.org/2024/1/e51391
UR  - http://dx.doi.org/10.2196/51391
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38349725
ID  - info:doi/10.2196/51391
ER  - 

TY  - JOUR
AU  - Giunti, Guido
AU  - Doherty, P. Colin
PY  - 2024/2/12
TI  - Cocreating an Automated mHealth Apps Systematic Review Process With Generative AI: Design Science Research Approach
JO  - JMIR Med Educ
SP  - e48949
VL  - 10
KW  - generative artificial intelligence
KW  - mHealth
KW  - ChatGPT
KW  - evidence-base
KW  - apps
KW  - qualitative study
KW  - design science research
KW  - eHealth
KW  - mobile device
KW  - AI
KW  - language model
KW  - mHealth intervention
KW  - generative AI
KW  - AI tool
KW  - software code
KW  - systematic review
N2  - Background: The use of mobile devices for delivering health-related services (mobile health [mHealth]) has rapidly increased, leading to a demand for summarizing the state of the art and practice through systematic reviews. However, the systematic review process is a resource-intensive and time-consuming process. Generative artificial intelligence (AI) has emerged as a potential solution to automate tedious tasks. Objective: This study aimed to explore the feasibility of using generative AI tools to automate time-consuming and resource-intensive tasks in a systematic review process and assess the scope and limitations of using such tools. Methods: We used the design science research methodology. The solution proposed is to use cocreation with a generative AI, such as ChatGPT, to produce software code that automates the process of conducting systematic reviews. Results: A triggering prompt was generated, and assistance from the generative AI was used to guide the steps toward developing, executing, and debugging a Python script. Errors in code were solved through conversational exchange with ChatGPT, and a tentative script was created. The code pulled the mHealth solutions from the Google Play Store and searched their descriptions for keywords that hinted toward evidence base. The results were exported to a CSV file, which was compared to the initial outputs of other similar systematic review processes. Conclusions: This study demonstrates the potential of using generative AI to automate the time-consuming process of conducting systematic reviews of mHealth apps. This approach could be particularly useful for researchers with limited coding skills. However, the study has limitations related to the design science research methodology, subjectivity bias, and the quality of the search results used to train the language model. 
UR  - https://mededu.jmir.org/2024/1/e48949
UR  - http://dx.doi.org/10.2196/48949
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38345839
ID  - info:doi/10.2196/48949
ER  - 

TY  - JOUR
AU  - Yu, Peng
AU  - Fang, Changchang
AU  - Liu, Xiaolin
AU  - Fu, Wanying
AU  - Ling, Jitao
AU  - Yan, Zhiwei
AU  - Jiang, Yuan
AU  - Cao, Zhengyu
AU  - Wu, Maoxiong
AU  - Chen, Zhiteng
AU  - Zhu, Wengen
AU  - Zhang, Yuling
AU  - Abudukeremu, Ayiguli
AU  - Wang, Yue
AU  - Liu, Xiao
AU  - Wang, Jingfeng
PY  - 2024/2/9
TI  - Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study
JO  - JMIR Med Educ
SP  - e48514
VL  - 10
KW  - ChatGPT
KW  - Chinese Postgraduate Examination for Clinical Medicine
KW  - medical student
KW  - performance
KW  - artificial intelligence
KW  - medical care
KW  - qualitative feedback
KW  - medical education
KW  - clinical decision-making
N2  - Background: ChatGPT, an artificial intelligence (AI) based on large-scale language models, has sparked interest in the field of health care. Nonetheless, the capabilities of AI in text comprehension and generation are constrained by the quality and volume of available training data for a specific language, and the performance of AI across different languages requires further investigation. While AI harbors substantial potential in medicine, it is imperative to tackle challenges such as the formulation of clinical care standards; facilitating cultural transitions in medical education and practice; and managing ethical issues including data privacy, consent, and bias. Objective: The study aimed to evaluate ChatGPT?s performance in processing Chinese Postgraduate Examination for Clinical Medicine questions, assess its clinical reasoning ability, investigate potential limitations with the Chinese language, and explore its potential as a valuable tool for medical professionals in the Chinese context. Methods: A data set of Chinese Postgraduate Examination for Clinical Medicine questions was used to assess the effectiveness of ChatGPT?s (version 3.5) medical knowledge in the Chinese language, which has a data set of 165 medical questions that were divided into three categories: (1) common questions (n=90) assessing basic medical knowledge, (2) case analysis questions (n=45) focusing on clinical decision-making through patient case evaluations, and (3) multichoice questions (n=30) requiring the selection of multiple correct answers. First of all, we assessed whether ChatGPT could meet the stringent cutoff score defined by the government agency, which requires a performance within the top 20% of candidates. Additionally, in our evaluation of ChatGPT?s performance on both original and encoded medical questions, 3 primary indicators were used: accuracy, concordance (which validates the answer), and the frequency of insights. Results: Our evaluation revealed that ChatGPT scored 153.5 out of 300 for original questions in Chinese, which signifies the minimum score set to ensure that at least 20% more candidates pass than the enrollment quota. However, ChatGPT had low accuracy in answering open-ended medical questions, with only 31.5% total accuracy. The accuracy for common questions, multichoice questions, and case analysis questions was 42%, 37%, and 17%, respectively. ChatGPT achieved a 90% concordance across all questions. Among correct responses, the concordance was 100%, significantly exceeding that of incorrect responses (n=57, 50%; P<.001). ChatGPT provided innovative insights for 80% (n=132) of all questions, with an average of 2.95 insights per accurate response. Conclusions: Although ChatGPT surpassed the passing threshold for the Chinese Postgraduate Examination for Clinical Medicine, its performance in answering open-ended medical questions was suboptimal. Nonetheless, ChatGPT exhibited high internal concordance and the ability to generate multiple insights in the Chinese language. Future research should investigate the language-based discrepancies in ChatGPT?s performance within the health care context. 
UR  - https://mededu.jmir.org/2024/1/e48514
UR  - http://dx.doi.org/10.2196/48514
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38335017
ID  - info:doi/10.2196/48514
ER  - 

TY  - JOUR
AU  - Meyer, Annika
AU  - Riese, Janik
AU  - Streichert, Thomas
PY  - 2024/2/8
TI  - Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
JO  - JMIR Med Educ
SP  - e50965
VL  - 10
KW  - ChatGPT
KW  - artificial intelligence
KW  - large language model
KW  - medical exams
KW  - medical examinations
KW  - medical education
KW  - LLM
KW  - public trust
KW  - trust
KW  - medical accuracy
KW  - licensing exam
KW  - licensing examination
KW  - improvement
KW  - patient care
KW  - general population
KW  - licensure examination
N2  - Background: The potential of artificial intelligence (AI)?based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods: To assess GPT-3.5?s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions: The study results highlight ChatGPT?s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4?s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population. 
UR  - https://mededu.jmir.org/2024/1/e50965
UR  - http://dx.doi.org/10.2196/50965
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38329802
ID  - info:doi/10.2196/50965
ER  - 

TY  - JOUR
AU  - Gray, Megan
AU  - Baird, Austin
AU  - Sawyer, Taylor
AU  - James, Jasmine
AU  - DeBroux, Thea
AU  - Bartlett, Michelle
AU  - Krick, Jeanne
AU  - Umoren, Rachel
PY  - 2024/2/1
TI  - Increasing Realism and Variety of Virtual Patient Dialogues for Prenatal Counseling Education Through a Novel Application of ChatGPT: Exploratory Observational Study
JO  - JMIR Med Educ
SP  - e50705
VL  - 10
KW  - prenatal counseling
KW  - virtual health
KW  - virtual patient
KW  - simulation
KW  - neonatology
KW  - ChatGPT
KW  - AI
KW  - artificial intelligence
N2  - Background: Using virtual patients, facilitated by natural language processing, provides a valuable educational experience for learners. Generating a large, varied sample of realistic and appropriate responses for virtual patients is challenging. Artificial intelligence (AI) programs can be a viable source for these responses, but their utility for this purpose has not been explored. Objective: In this study, we explored the effectiveness of generative AI (ChatGPT) in developing realistic virtual standardized patient dialogues to teach prenatal counseling skills. Methods: ChatGPT was prompted to generate a list of common areas of concern and questions that families expecting preterm delivery at 24 weeks gestation might ask during prenatal counseling. ChatGPT was then prompted to generate 2 role-plays with dialogues between a parent expecting a potential preterm delivery at 24 weeks and their counseling physician using each of the example questions. The prompt was repeated for 2 unique role-plays: one parent was characterized as anxious and the other as having low trust in the medical system. Role-play scripts were exported verbatim and independently reviewed by 2 neonatologists with experience in prenatal counseling, using a scale of 1-5 on realism, appropriateness, and utility for virtual standardized patient responses. Results: ChatGPT generated 7 areas of concern, with 35 example questions used to generate role-plays. The 35 role-play transcripts generated 176 unique parent responses (median 5, IQR 4-6, per role-play) with 268 unique sentences. Expert review identified 117 (65%) of the 176 responses as indicating an emotion, either directly or indirectly. Approximately half (98/176, 56%) of the responses had 2 or more sentences, and half (88/176, 50%) included at least 1 question. More than half (104/176, 58%) of the responses from role-played parent characters described a feeling, such as being scared, worried, or concerned. The role-plays of parents with low trust in the medical system generated many unique sentences (n=50). Most of the sentences in the responses were found to be reasonably realistic (214/268, 80%), appropriate for variable prenatal counseling conversation paths (233/268, 87%), and usable without more than a minimal modification in a virtual patient program (169/268, 63%). Conclusions: Generative AI programs, such as ChatGPT, may provide a viable source of training materials to expand virtual patient programs, with careful attention to the concerns and questions of patients and families. Given the potential for unrealistic or inappropriate statements and questions, an expert should review AI chat outputs before deploying them in an educational program. 
UR  - https://mededu.jmir.org/2024/1/e50705
UR  - http://dx.doi.org/10.2196/50705
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38300696
ID  - info:doi/10.2196/50705
ER  - 

TY  - JOUR
AU  - Kavadella, Argyro
AU  - Dias da Silva, Antonio Marco
AU  - Kaklamanos, G. Eleftherios
AU  - Stamatopoulos, Vasileios
AU  - Giannakopoulos, Kostis
PY  - 2024/1/31
TI  - Evaluation of ChatGPT?s Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e51344
VL  - 10
KW  - ChatGPT
KW  - large language models
KW  - LLM
KW  - natural language processing
KW  - artificial Intelligence
KW  - dental education
KW  - higher education
KW  - learning assignments
KW  - dental students
KW  - AI pedagogy
KW  - dentistry
KW  - university
N2  - Background: The recent artificial intelligence tool ChatGPT seems to offer a range of benefits in academic education while also raising concerns. Relevant literature encompasses issues of plagiarism and academic dishonesty, as well as pedagogy and educational affordances; yet, no real-life implementation of ChatGPT in the educational process has been reported to our knowledge so far. Objective: This mixed methods study aimed to evaluate the implementation of ChatGPT in the educational process, both quantitatively and qualitatively. Methods: In March 2023, a total of 77 second-year dental students of the European University Cyprus were divided into 2 groups and asked to compose a learning assignment on ?Radiation Biology and Radiation Protection in the Dental Office,? working collaboratively in small subgroups, as part of the educational semester program of the Dentomaxillofacial Radiology module. Careful planning ensured a seamless integration of ChatGPT, addressing potential challenges. One group searched the internet for scientific resources to perform the task and the other group used ChatGPT for this purpose. Both groups developed a PowerPoint (Microsoft Corp) presentation based on their research and presented it in class. The ChatGPT group students additionally registered all interactions with the language model during the prompting process and evaluated the final outcome; they also answered an open-ended evaluation questionnaire, including questions on their learning experience. Finally, all students undertook a knowledge examination on the topic, and the grades between the 2 groups were compared statistically, whereas the free-text comments of the questionnaires were thematically analyzed. Results: Out of the 77 students, 39 were assigned to the ChatGPT group and 38 to the literature research group. Seventy students undertook the multiple choice question knowledge examination, and examination grades ranged from 5 to 10 on the 0-10 grading scale. The Mann-Whitney U test showed that students of the ChatGPT group performed significantly better (P=.045) than students of the literature research group. The evaluation questionnaires revealed the benefits (human-like interface, immediate response, and wide knowledge base), the limitations (need for rephrasing the prompts to get a relevant answer, general content, false citations, and incapability to provide images or videos), and the prospects (in education, clinical practice, continuing education, and research) of ChatGPT. Conclusions: Students using ChatGPT for their learning assignments performed significantly better in the knowledge examination than their fellow students who used the literature research methodology. Students adapted quickly to the technological environment of the language model, recognized its opportunities and limitations, and used it creatively and efficiently. Implications for practice: the study underscores the adaptability of students to technological innovations including ChatGPT and its potential to enhance educational outcomes. Educators should consider integrating ChatGPT into curriculum design; awareness programs are warranted to educate both students and educators about the limitations of ChatGPT, encouraging critical engagement and responsible use. 
UR  - https://mededu.jmir.org/2024/1/e51344
UR  - http://dx.doi.org/10.2196/51344
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38111256
ID  - info:doi/10.2196/51344
ER  - 

TY  - JOUR
AU  - Haddad, Firas
AU  - Saade, S. Joanna
PY  - 2024/1/18
TI  - Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study
JO  - JMIR Med Educ
SP  - e50842
VL  - 10
KW  - ChatGPT
KW  - artificial intelligence
KW  - AI
KW  - board examinations
KW  - ophthalmology
KW  - testing
N2  - Background: ChatGPT and language learning models have gained attention recently for their ability to answer questions on various examinations across various disciplines. The question of whether ChatGPT could be used to aid in medical education is yet to be answered, particularly in the field of ophthalmology. Objective: The aim of this study is to assess the ability of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4.0 (GPT-4.0) to answer ophthalmology-related questions across different levels of ophthalmology training. Methods: Questions from the United States Medical Licensing Examination (USMLE) steps 1 (n=44), 2 (n=60), and 3 (n=28) were extracted from AMBOSS, and 248 questions (64 easy, 122 medium, and 62 difficult questions) were extracted from the book, Ophthalmology Board Review Q&A, for the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology (OB) Written Qualifying Examination (WQE). Questions were prompted identically and inputted to GPT-3.5 and GPT-4.0. Results: GPT-3.5 achieved a total of 55% (n=210) of correct answers, while GPT-4.0 achieved a total of 70% (n=270) of correct answers. GPT-3.5 answered 75% (n=33) of questions correctly in USMLE step 1, 73.33% (n=44) in USMLE step 2, 60.71% (n=17) in USMLE step 3, and 46.77% (n=116) in the OB-WQE. GPT-4.0 answered 70.45% (n=31) of questions correctly in USMLE step 1, 90.32% (n=56) in USMLE step 2, 96.43% (n=27) in USMLE step 3, and 62.90% (n=156) in the OB-WQE. GPT-3.5 performed poorer as examination levels advanced (P<.001), while GPT-4.0 performed better on USMLE steps 2 and 3 and worse on USMLE step 1 and the OB-WQE (P<.001). The coefficient of correlation (r) between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 as compared to ?0.31 (P<.001) for GPT-4.0. GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with an increase in the difficulty level. Both GPT models performed significantly better on certain topics than on others. Conclusions: ChatGPT is far from being considered a part of mainstream medical education. Future models with higher accuracy are needed for the platform to be effective in medical education. 
UR  - https://mededu.jmir.org/2024/1/e50842
UR  - http://dx.doi.org/10.2196/50842
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38236632
ID  - info:doi/10.2196/50842
ER  - 

TY  - JOUR
AU  - Nguyen, Tina
PY  - 2024/1/17
TI  - ChatGPT in Medical Education: A Precursor for Automation Bias?
JO  - JMIR Med Educ
SP  - e50174
VL  - 10
KW  - ChatGPT
KW  - artificial intelligence
KW  - AI
KW  - medical students
KW  - residents
KW  - medical school curriculum
KW  - medical education
KW  - automation bias
KW  - large language models
KW  - LLMs
KW  - bias
UR  - https://mededu.jmir.org/2024/1/e50174
UR  - http://dx.doi.org/10.2196/50174
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38231545
ID  - info:doi/10.2196/50174
ER  - 

TY  - JOUR
AU  - Holderried, Friederike
AU  - Stegemann?Philipps, Christian
AU  - Herschbach, Lea
AU  - Moldt, Julia-Astrid
AU  - Nevins, Andrew
AU  - Griewatz, Jan
AU  - Holderried, Martin
AU  - Herrmann-Werner, Anne
AU  - Festl-Wietek, Teresa
AU  - Mahling, Moritz
PY  - 2024/1/16
TI  - A Generative Pretrained Transformer (GPT)?Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study
JO  - JMIR Med Educ
SP  - e53961
VL  - 10
KW  - simulated patient
KW  - GPT
KW  - generative pretrained transformer
KW  - ChatGPT
KW  - history taking
KW  - medical education
KW  - documentation
KW  - history
KW  - simulated
KW  - simulation
KW  - simulations
KW  - NLP
KW  - natural language processing
KW  - artificial intelligence
KW  - interactive
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - answer
KW  - answers
KW  - response
KW  - responses
KW  - human computer
KW  - human machine
KW  - usability
KW  - satisfaction
N2  - Background: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions Objective: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication. Methods: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ). Results: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points). Conclusions: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information. 
UR  - https://mededu.jmir.org/2024/1/e53961
UR  - http://dx.doi.org/10.2196/53961
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38227363
ID  - info:doi/10.2196/53961
ER  - 

TY  - JOUR
AU  - Kuo, I-Hsien Nicholas
AU  - Perez-Concha, Oscar
AU  - Hanly, Mark
AU  - Mnatzaganian, Emmanuel
AU  - Hao, Brandon
AU  - Di Sipio, Marcus
AU  - Yu, Guolin
AU  - Vanjara, Jash
AU  - Valerie, Cerelia Ivy
AU  - de Oliveira Costa, Juliana
AU  - Churches, Timothy
AU  - Lujic, Sanja
AU  - Hegarty, Jo
AU  - Jorm, Louisa
AU  - Barbieri, Sebastiano
PY  - 2024/1/16
TI  - Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project
JO  - JMIR Med Educ
SP  - e51388
VL  - 10
KW  - medical education
KW  - generative model
KW  - generative adversarial networks
KW  - privacy
KW  - antiretroviral therapy (ART)
KW  - human immunodeficiency virus (HIV)
KW  - data science
KW  - educational purposes
KW  - accessibility
KW  - data privacy
KW  - data sets
KW  - sepsis
KW  - hypotension
KW  - HIV
KW  - science education
KW  - health care AI
UR  - https://mededu.jmir.org/2024/1/e51388
UR  - http://dx.doi.org/10.2196/51388
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38227356
ID  - info:doi/10.2196/51388
ER  - 

TY  - JOUR
AU  - Long, Cai
AU  - Lowe, Kayle
AU  - Zhang, Jessica
AU  - Santos, dos André
AU  - Alanazi, Alaa
AU  - O'Brien, Daniel
AU  - Wright, D. Erin
AU  - Cote, David
PY  - 2024/1/16
TI  - A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology?Head and Neck Surgery Certification Examinations: Performance Study
JO  - JMIR Med Educ
SP  - e49970
VL  - 10
KW  - medical licensing
KW  - otolaryngology
KW  - otology
KW  - laryngology
KW  - ear
KW  - nose
KW  - throat
KW  - ENT
KW  - surgery
KW  - surgical
KW  - exam
KW  - exams
KW  - response
KW  - responses
KW  - answer
KW  - answers
KW  - chatbot
KW  - chatbots
KW  - examination
KW  - examinations
KW  - medical education
KW  - otolaryngology/head and neck surgery
KW  - OHNS
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - medical examination
KW  - large language models
KW  - language model
KW  - LLM
KW  - LLMs
KW  - wide range information
KW  - patient safety
KW  - clinical implementation
KW  - safety
KW  - machine learning
KW  - NLP
KW  - natural language processing
N2  - Background: ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology?head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. Objective: We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model?s performance on open-ended medical board examination questions. Methods: Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada?s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. Results: In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. Conclusions: ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation. 
UR  - https://mededu.jmir.org/2024/1/e49970
UR  - http://dx.doi.org/10.2196/49970
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38227351
ID  - info:doi/10.2196/49970
ER  - 

TY  - JOUR
AU  - Al-Worafi, Mohammed Yaser
AU  - Goh, Wen Khang
AU  - Hermansyah, Andi
AU  - Tan, Siang Ching
AU  - Ming, Chiau Long
PY  - 2024/1/12
TI  - The Use of ChatGPT for Education Modules on Integrated Pharmacotherapy of Infectious Disease: Educators' Perspectives
JO  - JMIR Med Educ
SP  - e47339
VL  - 10
KW  - innovation and technology
KW  - quality education
KW  - sustainable communities
KW  - innovation and infrastructure
KW  - partnerships for the goals
KW  - sustainable education
KW  - social justice
KW  - ChatGPT
KW  - artificial intelligence
KW  - feasibility
N2  - Background: Artificial Intelligence (AI) plays an important role in many fields, including medical education, practice, and research. Many medical educators started using ChatGPT at the end of 2022 for many purposes. Objective: The aim of this study was to explore the potential uses, benefits, and risks of using ChatGPT in education modules on integrated pharmacotherapy of infectious disease. Methods: A content analysis was conducted to investigate the applications of ChatGPT in education modules on integrated pharmacotherapy of infectious disease. Questions pertaining to curriculum development, syllabus design, lecture note preparation, and examination construction were posed during data collection. Three experienced professors rated the appropriateness and precision of the answers provided by ChatGPT. The consensus rating was considered. The professors also discussed the prospective applications, benefits, and risks of ChatGPT in this educational setting. Results: ChatGPT demonstrated the ability to contribute to various aspects of curriculum design, with ratings ranging from 50% to 92% for appropriateness and accuracy. However, there were limitations and risks associated with its use, including incomplete syllabi, the absence of essential learning objectives, and the inability to design valid questionnaires and qualitative studies. It was suggested that educators use ChatGPT as a resource rather than relying primarily on its output. There are recommendations for effectively incorporating ChatGPT into the curriculum of the education modules on integrated pharmacotherapy of infectious disease. Conclusions: Medical and health sciences educators can use ChatGPT as a guide in many aspects related to the development of the curriculum of the education modules on integrated pharmacotherapy of infectious disease, syllabus design, lecture notes preparation, and examination preparation with caution. 
UR  - https://mededu.jmir.org/2024/1/e47339
UR  - http://dx.doi.org/10.2196/47339
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38214967
ID  - info:doi/10.2196/47339
ER  - 

TY  - JOUR
AU  - Zaleski, L. Amanda
AU  - Berkowsky, Rachel
AU  - Craig, Thomas Kelly Jean
AU  - Pescatello, S. Linda
PY  - 2024/1/11
TI  - Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study
JO  - JMIR Med Educ
SP  - e51308
VL  - 10
KW  - exercise prescription
KW  - health literacy
KW  - large language model
KW  - patient education
KW  - artificial intelligence
KW  - AI
KW  - chatbot
N2  - Background: Regular physical activity is critical for health and disease prevention. Yet, health care providers and patients face barriers to implement evidence-based lifestyle recommendations. The potential to augment care with the increased availability of artificial intelligence (AI) technologies is limitless; however, the suitability of AI-generated exercise recommendations has yet to be explored. Objective: The purpose of this study was to assess the comprehensiveness, accuracy, and readability of individualized exercise recommendations generated by a novel AI chatbot. Methods: A coding scheme was developed to score AI-generated exercise recommendations across ten categories informed by gold-standard exercise recommendations, including (1) health condition?specific benefits of exercise, (2) exercise preparticipation health screening, (3) frequency, (4) intensity, (5) time, (6) type, (7) volume, (8) progression, (9) special considerations, and (10) references to the primary literature. The AI chatbot was prompted to provide individualized exercise recommendations for 26 clinical populations using an open-source application programming interface. Two independent reviewers coded AI-generated content for each category and calculated comprehensiveness (%) and factual accuracy (%) on a scale of 0%-100%. Readability was assessed using the Flesch-Kincaid formula. Qualitative analysis identified and categorized themes from AI-generated output. Results: AI-generated exercise recommendations were 41.2% (107/260) comprehensive and 90.7% (146/161) accurate, with the majority (8/15, 53%) of inaccuracy related to the need for exercise preparticipation medical clearance. Average readability level of AI-generated exercise recommendations was at the college level (mean 13.7, SD 1.7), with an average Flesch reading ease score of 31.1 (SD 7.7). Several recurring themes and observations of AI-generated output included concern for liability and safety, preference for aerobic exercise, and potential bias and direct discrimination against certain age-based populations and individuals with disabilities. Conclusions: There were notable gaps in the comprehensiveness, accuracy, and readability of AI-generated exercise recommendations. Exercise and health care professionals should be aware of these limitations when using and endorsing AI-based technologies as a tool to support lifestyle change involving exercise. 
UR  - https://mededu.jmir.org/2024/1/e51308
UR  - http://dx.doi.org/10.2196/51308
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38206661
ID  - info:doi/10.2196/51308
ER  - 

TY  - JOUR
AU  - Weidener, Lukas
AU  - Fischer, Michael
PY  - 2024/1/5
TI  - Artificial Intelligence in Medicine: Cross-Sectional Study Among Medical Students on Application, Education, and Ethical Aspects
JO  - JMIR Med Educ
SP  - e51247
VL  - 10
KW  - artificial intelligence
KW  - AI technology
KW  - medicine
KW  - medical education
KW  - medical curriculum
KW  - medical school
KW  - AI ethics
KW  - ethics
N2  - Background: The use of artificial intelligence (AI) in medicine not only directly impacts the medical profession but is also increasingly associated with various potential ethical aspects. In addition, the expanding use of AI and AI-based applications such as ChatGPT demands a corresponding shift in medical education to adequately prepare future practitioners for the effective use of these tools and address the associated ethical challenges they present. Objective: This study aims to explore how medical students from Germany, Austria, and Switzerland perceive the use of AI in medicine and the teaching of AI and AI ethics in medical education in accordance with their use of AI-based chat applications, such as ChatGPT. Methods: This cross-sectional study, conducted from June 15 to July 15, 2023, surveyed medical students across Germany, Austria, and Switzerland using a web-based survey. This study aimed to assess students? perceptions of AI in medicine and the integration of AI and AI ethics into medical education. The survey, which included 53 items across 6 sections, was developed and pretested. Data analysis used descriptive statistics (median, mode, IQR, total number, and percentages) and either the chi-square or Mann-Whitney U tests, as appropriate. Results: Surveying 487 medical students across Germany, Austria, and Switzerland revealed limited formal education on AI or AI ethics within medical curricula, although 38.8% (189/487) had prior experience with AI-based chat applications, such as ChatGPT. Despite varied prior exposures, 71.7% (349/487) anticipated a positive impact of AI on medicine. There was widespread consensus (385/487, 74.9%) on the need for AI and AI ethics instruction in medical education, although the current offerings were deemed inadequate. Regarding the AI ethics education content, all proposed topics were rated as highly relevant. Conclusions: This study revealed a pronounced discrepancy between the use of AI-based (chat) applications, such as ChatGPT, among medical students in Germany, Austria, and Switzerland and the teaching of AI in medical education. To adequately prepare future medical professionals, there is an urgent need to integrate the teaching of AI and AI ethics into the medical curricula. 
UR  - https://mededu.jmir.org/2024/1/e51247
UR  - http://dx.doi.org/10.2196/51247
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38180787
ID  - info:doi/10.2196/51247
ER  - 

TY  - JOUR
AU  - Knoedler, Leonard
AU  - Alfertshofer, Michael
AU  - Knoedler, Samuel
AU  - Hoch, C. Cosima
AU  - Funk, F. Paul
AU  - Cotofana, Sebastian
AU  - Maheta, Bhagvat
AU  - Frank, Konstantin
AU  - Brébant, Vanessa
AU  - Prantl, Lukas
AU  - Lamby, Philipp
PY  - 2024/1/5
TI  - Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
JO  - JMIR Med Educ
SP  - e51148
VL  - 10
KW  - ChatGPT
KW  - United States Medical Licensing Examination
KW  - artificial intelligence
KW  - USMLE
KW  - USMLE Step 1
KW  - OpenAI
KW  - medical education
KW  - clinical decision-making
N2  - Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student?s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT?s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective: This paper aimed to analyze ChatGPT?s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (?=?0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ?=?0.289 for ChatGPT 3.5 and ?=?0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics. 
UR  - https://mededu.jmir.org/2024/1/e51148
UR  - http://dx.doi.org/10.2196/51148
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38180782
ID  - info:doi/10.2196/51148
ER  - 

TY  - JOUR
AU  - Blease, Charlotte
AU  - Torous, John
AU  - McMillan, Brian
AU  - Hägglund, Maria
AU  - Mandl, D. Kenneth
PY  - 2024/1/4
TI  - Generative Language Models and Open Notes: Exploring the Promise and Limitations
JO  - JMIR Med Educ
SP  - e51183
VL  - 10
KW  - ChatGPT
KW  - generative language models
KW  - large language models
KW  - medical education
KW  - Open Notes
KW  - online record access
KW  - patient-centered care
KW  - empathy
KW  - language model
KW  - documentation
KW  - communication tool
KW  - clinical documentation
UR  - https://mededu.jmir.org/2024/1/e51183
UR  - http://dx.doi.org/10.2196/51183
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38175688
ID  - info:doi/10.2196/51183
ER  - 

TY  - JOUR
AU  - Erren, C. Thomas
PY  - 2024/1/4
TI  - Patients, Doctors, and Chatbots
JO  - JMIR Med Educ
SP  - e50869
VL  - 10
KW  - chatbot
KW  - ChatGPT
KW  - medical advice
KW  - ethics
KW  - patients
KW  - doctors
UR  - https://mededu.jmir.org/2024/1/e50869
UR  - http://dx.doi.org/10.2196/50869
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38175695
ID  - info:doi/10.2196/50869
ER  - 

TY  - JOUR
AU  - Koranteng, Erica
AU  - Rao, Arya
AU  - Flores, Efren
AU  - Lev, Michael
AU  - Landman, Adam
AU  - Dreyer, Keith
AU  - Succi, Marc
PY  - 2023/12/28
TI  - Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care
JO  - JMIR Med Educ
SP  - e51199
VL  - 9
KW  - ChatGPT
KW  - AI
KW  - artificial intelligence
KW  - large language models
KW  - LLMs
KW  - ethics
KW  - empathy
KW  - equity
KW  - bias
KW  - language model
KW  - health care application
KW  - patient care
KW  - care
KW  - development
KW  - framework
KW  - model
KW  - ethical implication
UR  - https://mededu.jmir.org/2023/1/e51199
UR  - http://dx.doi.org/10.2196/51199
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38153778
ID  - info:doi/10.2196/51199
ER  - 

TY  - JOUR
AU  - Liao, Wenxiong
AU  - Liu, Zhengliang
AU  - Dai, Haixing
AU  - Xu, Shaochen
AU  - Wu, Zihao
AU  - Zhang, Yiyang
AU  - Huang, Xiaoke
AU  - Zhu, Dajiang
AU  - Cai, Hongmin
AU  - Li, Quanzheng
AU  - Liu, Tianming
AU  - Li, Xiang
PY  - 2023/12/28
TI  - Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study
JO  - JMIR Med Educ
SP  - e48904
VL  - 9
KW  - ChatGPT
KW  - medical ethics
KW  - linguistic analysis
KW  - text classification
KW  - artificial intelligence
KW  - medical texts
KW  - machine learning
N2  - Background: Large language models, such as ChatGPT, are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the internet. However, medical texts, such as clinical notes and diagnoses, require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to health care and the general public. Objective: This study is among the first on responsible artificial intelligence?generated content in medicine. We focus on analyzing the differences between medical texts written by human experts and those generated by ChatGPT and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. Methods: We first constructed a suite of data sets containing medical texts written by human experts and generated by ChatGPT. We analyzed the linguistic features of these 2 types of content and uncovered differences in vocabulary, parts-of-speech, dependency, sentiment, perplexity, and other aspects. Finally, we designed and implemented machine learning methods to detect medical text generated by ChatGPT. The data and code used in this paper are published on GitHub. Results: Medical texts written by humans were more concrete, more diverse, and typically contained more useful information, while medical texts generated by ChatGPT paid more attention to fluency and logic and usually expressed general terminologies rather than effective information specific to the context of the problem. A bidirectional encoder representations from transformers?based model effectively detected medical texts generated by ChatGPT, and the F1 score exceeded 95%. Conclusions: Although text generated by ChatGPT is grammatically perfect and human-like, the linguistic characteristics of generated medical texts were different from those written by human experts. Medical text generated by ChatGPT could be effectively detected by the proposed machine learning algorithms. This study provides a pathway toward trustworthy and accountable use of large language models in medicine. 
UR  - https://mededu.jmir.org/2023/1/e48904
UR  - http://dx.doi.org/10.2196/48904
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38153785
ID  - info:doi/10.2196/48904
ER  - 

TY  - JOUR
AU  - Knopp, I. Michelle
AU  - Warm, J. Eric
AU  - Weber, Danielle
AU  - Kelleher, Matthew
AU  - Kinnear, Benjamin
AU  - Schumacher, J. Daniel
AU  - Santen, A. Sally
AU  - Mendonça, Eneida
AU  - Turner, Laurah
PY  - 2023/12/25
TI  - AI-Enabled Medical Education: Threads of Change, Promising Futures, and Risky Realities Across Four Potential Future Worlds
JO  - JMIR Med Educ
SP  - e50373
VL  - 9
KW  - artificial intelligence
KW  - medical education
KW  - scenario planning
KW  - future of healthcare
KW  - ethics and AI
KW  - future
KW  - scenario
KW  - ChatGPT
KW  - generative
KW  - GPT-4
KW  - ethic
KW  - ethics
KW  - ethical
KW  - strategic planning
KW  - Open-AI
KW  - OpenAI
KW  - privacy
KW  - autonomy
KW  - autonomous
N2  - Background: The rapid trajectory of artificial intelligence (AI) development and advancement is quickly outpacing society's ability to determine its future role. As AI continues to transform various aspects of our lives, one critical question arises for medical education: what will be the nature of education, teaching, and learning in a future world where the acquisition, retention, and application of knowledge in the traditional sense are fundamentally altered by AI? Objective: The purpose of this perspective is to plan for the intersection of health care and medical education in the future. Methods: We used GPT-4 and scenario-based strategic planning techniques to craft 4 hypothetical future worlds influenced by AI's integration into health care and medical education. This method, used by organizations such as Shell and the Accreditation Council for Graduate Medical Education, assesses readiness for alternative futures and effectively manages uncertainty, risk, and opportunity. The detailed scenarios provide insights into potential environments the medical profession may face and lay the foundation for hypothesis generation and idea-building regarding responsible AI implementation. Results: The following 4 worlds were created using OpenAI?s GPT model: AI Harmony, AI conflict, The world of Ecological Balance, and Existential Risk. Risks include disinformation and misinformation, loss of privacy, widening inequity, erosion of human autonomy, and ethical dilemmas. Benefits involve improved efficiency, personalized interventions, enhanced collaboration, early detection, and accelerated research. Conclusions: To ensure responsible AI use, the authors suggest focusing on 3 key areas: developing a robust ethical framework, fostering interdisciplinary collaboration, and investing in education and training. A strong ethical framework emphasizes patient safety, privacy, and autonomy while promoting equity and inclusivity. Interdisciplinary collaboration encourages cooperation among various experts in developing and implementing AI technologies, ensuring that they address the complex needs and challenges in health care and medical education. Investing in education and training prepares professionals and trainees with necessary skills and knowledge to effectively use and critically evaluate AI technologies. The integration of AI in health care and medical education presents a critical juncture between transformative advancements and significant risks. By working together to address both immediate and long-term risks and consequences, we can ensure that AI integration leads to a more equitable, sustainable, and prosperous future for both health care and medical education. As we engage with AI technologies, our collective actions will ultimately determine the state of the future of health care and medical education to harness AI's power while ensuring the safety and well-being of humanity. 
UR  - https://mededu.jmir.org/2023/1/e50373
UR  - http://dx.doi.org/10.2196/50373
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38145471
ID  - info:doi/10.2196/50373
ER  - 

TY  - JOUR
AU  - Alkhaaldi, I. Saif M.
AU  - Kassab, H. Carl
AU  - Dimassi, Zakia
AU  - Oyoun Alsoud, Leen
AU  - Al Fahim, Maha
AU  - Al Hageh, Cynthia
AU  - Ibrahim, Halah
PY  - 2023/12/22
TI  - Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e51302
VL  - 9
KW  - medical education
KW  - ChatGPT
KW  - artificial intelligence
KW  - large language models
KW  - LLMs
KW  - AI
KW  - medical student
KW  - medical students
KW  - cross-sectional study
KW  - training
KW  - technology
KW  - medicine
KW  - health care professionals
KW  - risk
KW  - education
N2  - Background: Artificial intelligence (AI) has the potential to revolutionize the way medicine is learned, taught, and practiced, and medical education must prepare learners for these inevitable changes. Academic medicine has, however, been slow to embrace recent AI advances. Since its launch in November 2022, ChatGPT has emerged as a fast and user-friendly large language model that can assist health care professionals, medical educators, students, trainees, and patients. While many studies focus on the technology?s capabilities, potential, and risks, there is a gap in studying the perspective of end users. Objective: The aim of this study was to gauge the experiences and perspectives of graduating medical students on ChatGPT and AI in their training and future careers. Methods: A cross-sectional web-based survey of recently graduated medical students was conducted in an international academic medical center between May 5, 2023, and June 13, 2023. Descriptive statistics were used to tabulate variable frequencies. Results: Of 325 applicants to the residency programs, 265 completed the survey (an 81.5% response rate). The vast majority of respondents denied using ChatGPT in medical school, with 20.4% (n=54) using it to help complete written assessments and only 9.4% using the technology in their clinical work (n=25). More students planned to use it during residency, primarily for exploring new medical topics and research (n=168, 63.4%) and exam preparation (n=151, 57%). Male students were significantly more likely to believe that AI will improve diagnostic accuracy (n=47, 51.7% vs n=69, 39.7%; P=.001), reduce medical error (n=53, 58.2% vs n=71, 40.8%; P=.002), and improve patient care (n=60, 65.9% vs n=95, 54.6%; P=.007). Previous experience with AI was significantly associated with positive AI perception in terms of improving patient care, decreasing medical errors and misdiagnoses, and increasing the accuracy of diagnoses (P=.001, P<.001, P=.008, respectively). Conclusions: The surveyed medical students had minimal formal and informal experience with AI tools and limited perceptions of the potential uses of AI in health care but had overall positive views of ChatGPT and AI and were optimistic about the future of AI in medical education and health care. Structured curricula and formal policies and guidelines are needed to adequately prepare medical learners for the forthcoming integration of AI in medicine. 
UR  - https://mededu.jmir.org/2023/1/e51302
UR  - http://dx.doi.org/10.2196/51302
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38133911
ID  - info:doi/10.2196/51302
ER  - 

TY  - JOUR
AU  - Tangadulrat, Pasin
AU  - Sono, Supinya
AU  - Tangtrakulwanich, Boonsin
PY  - 2023/12/22
TI  - Using ChatGPT for Clinical Practice and Medical Education: Cross-Sectional Survey of Medical Students? and Physicians? Perceptions
JO  - JMIR Med Educ
SP  - e50658
VL  - 9
KW  - ChatGPT
KW  - AI
KW  - artificial intelligence
KW  - medical education
KW  - medical students
KW  - student
KW  - students
KW  - intern
KW  - interns
KW  - resident
KW  - residents
KW  - knee osteoarthritis
KW  - survey
KW  - surveys
KW  - questionnaire
KW  - questionnaires
KW  - chatbot
KW  - chatbots
KW  - conversational agent
KW  - conversational agents
KW  - attitude
KW  - attitudes
KW  - opinion
KW  - opinions
KW  - perception
KW  - perceptions
KW  - perspective
KW  - perspectives
KW  - acceptance
N2  - Background: ChatGPT is a well-known large language model?based chatbot. It could be used in the medical field in many aspects. However, some physicians are still unfamiliar with ChatGPT and are concerned about its benefits and risks. Objective: We aim to evaluate the perception of physicians and medical students toward using ChatGPT in the medical field. Methods: A web-based questionnaire was sent to medical students, interns, residents, and attending staff with questions regarding their perception toward using ChatGPT in clinical practice and medical education. Participants were also asked to rate their perception of ChatGPT?s generated response about knee osteoarthritis. Results: Participants included 124 medical students, 46 interns, 37 residents, and 32 attending staff. After reading ChatGPT?s response, 132 of the 239 (55.2%) participants had a positive rating about using ChatGPT for clinical practice. The proportion of positive answers was significantly lower in graduated physicians (48/115, 42%) compared with medical students (84/124, 68%; P<.001). Participants listed a lack of a patient-specific treatment plan, updated evidence, and a language barrier as ChatGPT?s pitfalls. Regarding using ChatGPT for medical education, the proportion of positive responses was also significantly lower in graduate physicians (71/115, 62%) compared to medical students (103/124, 83.1%; P<.001). Participants were concerned that ChatGPT?s response was too superficial, might lack scientific evidence, and might need expert verification. Conclusions: Medical students generally had a positive perception of using ChatGPT for guiding treatment and medical education, whereas graduated doctors were more cautious in this regard. Nonetheless, both medical students and graduated doctors positively perceived using ChatGPT for creating patient educational materials. 
UR  - https://mededu.jmir.org/2023/1/e50658
UR  - http://dx.doi.org/10.2196/50658
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38133908
ID  - info:doi/10.2196/50658
ER  - 

TY  - JOUR
AU  - Buhr, Raphael Christoph
AU  - Smith, Harry
AU  - Huppertz, Tilman
AU  - Bahr-Hamm, Katharina
AU  - Matthias, Christoph
AU  - Blaikie, Andrew
AU  - Kelsey, Tom
AU  - Kuhn, Sebastian
AU  - Eckrich, Jonas
PY  - 2023/12/5
TI  - ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case?Based Questions
JO  - JMIR Med Educ
SP  - e49183
VL  - 9
KW  - large language models
KW  - LLMs
KW  - LLM
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - otorhinolaryngology
KW  - ORL
KW  - digital health
KW  - chatbots
KW  - global health
KW  - low- and middle-income countries
KW  - telemedicine
KW  - telehealth
KW  - language model
KW  - chatbot
N2  - Background: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more ?consultations? of LLMs about personal medical symptoms. Objective: This study aims to evaluate ChatGPT?s performance in answering clinical case?based questions in otorhinolaryngology (ORL) in comparison to ORL consultants? answers. Methods: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs. Results: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT?s scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT?s answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08)  improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001). Conclusions: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants? answers. LLMs have potential as augmentative tools for medical care, but their ?consultation? for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits. 
UR  - https://mededu.jmir.org/2023/1/e49183
UR  - http://dx.doi.org/10.2196/49183
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38051578
ID  - info:doi/10.2196/49183
ER  - 

TY  - JOUR
AU  - Spallek, Sophia
AU  - Birrell, Louise
AU  - Kershaw, Stephanie
AU  - Devine, Krogh Emma
AU  - Thornton, Louise
PY  - 2023/11/30
TI  - Can we use ChatGPT for Mental Health and Substance Use Education? Examining Its Quality and Potential Harms
JO  - JMIR Med Educ
SP  - e51243
VL  - 9
KW  - artificial intelligence
KW  - generative artificial intelligence
KW  - large language models
KW  - ChatGPT
KW  - medical education
KW  - health education
KW  - patient education handout
KW  - preventive health services
KW  - educational intervention
KW  - mental health
KW  - substance use
N2  - Background: The use of generative artificial intelligence, more specifically large language models (LLMs), is proliferating, and as such, it is vital to consider both the value and potential harms of its use in medical education. Their efficiency in a variety of writing styles makes LLMs, such as ChatGPT, attractive for tailoring educational materials. However, this technology can feature biases and misinformation, which can be particularly harmful in medical education settings, such as mental health and substance use education. This viewpoint investigates if ChatGPT is sufficient for 2 common health education functions in the field of mental health and substance use: (1) answering users? direct queries and (2) aiding in the development of quality consumer educational health materials. Objective: This viewpoint includes a case study to provide insight into the accessibility, biases, and quality of ChatGPT?s query responses and educational health materials. We aim to provide guidance for the general public and health educators wishing to utilize LLMs. Methods: We collected real world queries from 2 large-scale mental health and substance use portals and engineered a variety of prompts to use on GPT-4 Pro with the Bing BETA internet browsing plug-in. The outputs were evaluated with tools from the Sydney Health Literacy Lab to determine the accessibility, the adherence to Mindframe communication guidelines to identify biases, and author assessments on quality, including tailoring to audiences, duty of care disclaimers, and evidence-based internet references. Results: GPT-4?s outputs had good face validity, but upon detailed analysis were substandard in comparison to expert-developed materials. Without engineered prompting, the reading level, adherence to communication guidelines, and use of evidence-based websites were poor. Therefore, all outputs still required cautious human editing and oversight. Conclusions: GPT-4 is currently not reliable enough for direct-consumer queries, but educators and researchers can use it for creating educational materials with caution. Materials created with LLMs should disclose the use of generative artificial intelligence and be evaluated on their efficacy with the target audience. 
UR  - https://mededu.jmir.org/2023/1/e51243
UR  - http://dx.doi.org/10.2196/51243
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38032714
ID  - info:doi/10.2196/51243
ER  - 

TY  - JOUR
AU  - Wong, Shin-Yee Rebecca
AU  - Ming, Chiau Long
AU  - Raja Ali, Affendi Raja
PY  - 2023/11/21
TI  - The Intersection of ChatGPT, Clinical Medicine, and Medical Education
JO  - JMIR Med Educ
SP  - e47274
VL  - 9
KW  - ChatGPT
KW  - clinical research
KW  - large language model
KW  - artificial intelligence
KW  - ethical considerations
KW  - AI
KW  - OpenAI
UR  - https://mededu.jmir.org/2023/1/e47274
UR  - http://dx.doi.org/10.2196/47274
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37988149
ID  - info:doi/10.2196/47274
ER  - 

TY  - JOUR
AU  - Scherr, Riley
AU  - Halaseh, F. Faris
AU  - Spina, Aidin
AU  - Andalib, Saman
AU  - Rivera, Ronald
PY  - 2023/11/10
TI  - ChatGPT Interactive Medical Simulations for Early Clinical Education: Case Study
JO  - JMIR Med Educ
SP  - e49877
VL  - 9
KW  - ChatGPT
KW  - medical school simulations
KW  - preclinical curriculum
KW  - artificial intelligence
KW  - AI
KW  - AI in medical education
KW  - medical education
KW  - simulation
KW  - generative
KW  - curriculum
KW  - clinical education
KW  - simulations
N2  - Background: The transition to clinical clerkships can be difficult for medical students, as it requires the synthesis and application of preclinical information into diagnostic and therapeutic decisions. ChatGPT?a generative language model with many medical applications due to its creativity, memory, and accuracy?can help students in this transition. Objective: This paper models ChatGPT 3.5?s ability to perform interactive clinical simulations and shows this tool?s benefit to medical education. Methods: Simulation starting prompts were refined using ChatGPT 3.5 in Google Chrome. Starting prompts were selected based on assessment format, stepwise progression of simulation events and questions, free-response question type, responsiveness to user inputs, postscenario feedback, and medical accuracy of the feedback. The chosen scenarios were advanced cardiac life support and medical intensive care (for sepsis and pneumonia). Results: Two starting prompts were chosen. Prompt 1 was developed through 3 test simulations and used successfully in 2 simulations. Prompt 2 was developed through 10 additional test simulations and used successfully in 1 simulation. Conclusions: ChatGPT is capable of creating simulations for early clinical education. These simulations let students practice novel parts of the clinical curriculum, such as forming independent diagnostic and therapeutic impressions over an entire patient encounter. Furthermore, the simulations can adapt to user inputs in a way that replicates real life more accurately than premade question bank clinical vignettes. Finally, ChatGPT can create potentially unlimited free simulations with specific feedback, which increases access for medical students with lower socioeconomic status and underresourced medical schools. However, no tool is perfect, and ChatGPT is no exception; there are concerns about simulation accuracy and replicability that need to be addressed to further optimize ChatGPT?s performance as an educational resource. 
UR  - https://mededu.jmir.org/2023/1/e49877
UR  - http://dx.doi.org/10.2196/49877
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37948112
ID  - info:doi/10.2196/49877
ER  - 

TY  - JOUR
AU  - Abuyaman, Omar
PY  - 2023/11/10
TI  - Strengths and Weaknesses of ChatGPT Models for Scientific Writing About Medical Vitamin B12: Mixed Methods Study
JO  - JMIR Form Res
SP  - e49459
VL  - 7
KW  - AI
KW  - ChatGPT
KW  - GPT-4
KW  - GPT-3.5
KW  - vitamin B12
KW  - artificial intelligence
KW  - language editing
KW  - wide range information
KW  - AI solutions
KW  - scientific content
N2  - Background: ChatGPT is a large language model developed by OpenAI designed to generate human-like responses to prompts. Objective: This study aims to evaluate the ability of GPT-4 to generate scientific content and assist in scientific writing using medical vitamin B12 as the topic. Furthermore, the study will compare the performance of GPT-4 to its predecessor, GPT-3.5. Methods: The study examined responses from GPT-4 and GPT-3.5 to vitamin B12?related prompts, focusing on their quality and characteristics and comparing them to established scientific literature. Results: The results indicated that GPT-4 can potentially streamline scientific writing through its ability to edit language and write abstracts, keywords, and abbreviation lists. However, significant limitations of ChatGPT were revealed, including its inability to identify and address bias, inability to include recent information, lack of transparency, and inclusion of inaccurate information. Additionally, it cannot check for plagiarism or provide proper references. The accuracy of GPT-4?s answers was found to be superior to GPT-3.5. Conclusions: ChatGPT can be considered a helpful assistant in the writing process but not a replacement for a scientist?s expertise. Researchers must remain aware of its limitations and use it appropriately. The improvements in consecutive ChatGPT versions suggest the possibility of overcoming some present limitations in the near future. 
UR  - https://formative.jmir.org/2023/1/e49459
UR  - http://dx.doi.org/10.2196/49459
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37948100
ID  - info:doi/10.2196/49459
ER  - 

TY  - JOUR
AU  - Surapaneni, Mohan Krishna
PY  - 2023/11/7
TI  - Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study
JO  - JMIR Med Educ
SP  - e47191
VL  - 9
KW  - ChatGPT
KW  - artificial intelligence
KW  - medical education
KW  - medical Biochemistry
KW  - biochemistry
KW  - chatbot
KW  - case study
KW  - case scenario
KW  - medical exam
KW  - medical examination
KW  - computer generated
N2  - Background: ChatGPT has gained global attention recently owing to its high performance in generating a wide range of information and retrieving any kind of data instantaneously. ChatGPT has also been tested for the United States Medical Licensing Examination (USMLE) and has successfully cleared it. Thus, its usability in medical education is now one of the key discussions worldwide. Objective: The objective of this study is to evaluate the performance of ChatGPT in medical biochemistry using clinical case vignettes. Methods: The performance of ChatGPT was evaluated in medical biochemistry using 10 clinical case vignettes. Clinical case vignettes were randomly selected and inputted in ChatGPT along with the response options. We tested the responses for each clinical case twice. The answers generated by ChatGPT were saved and checked using our reference material. Results: ChatGPT generated correct answers for 4 questions on the first attempt. For the other cases, there were differences in responses generated by ChatGPT in the first and second attempts. In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases that were used. But, to our surprise, for case 3, different answers were obtained with multiple attempts. We believe this to have happened owing to the complexity of the case, which involved addressing various critical medical aspects related to amino acid metabolism in a balanced approach. Conclusions: According to the findings of our study, ChatGPT may not be considered an accurate information provider for application in medical education to improve learning and assessment. However, our study was limited by a small sample size (10 clinical case vignettes) and the use of the publicly available version of ChatGPT (version 3.5). Although artificial intelligence (AI) has the capability to transform medical education, we emphasize the validation of such data produced by such AI systems for correctness and dependability before it could be implemented in practice. 
UR  - https://mededu.jmir.org/2023/1/e47191
UR  - http://dx.doi.org/10.2196/47191
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37934568
ID  - info:doi/10.2196/47191
ER  - 

TY  - JOUR
AU  - Ito, Naoki
AU  - Kadomatsu, Sakina
AU  - Fujisawa, Mineto
AU  - Fukaguchi, Kiyomitsu
AU  - Ishizawa, Ryo
AU  - Kanda, Naoki
AU  - Kasugai, Daisuke
AU  - Nakajima, Mikio
AU  - Goto, Tadahiro
AU  - Tsugawa, Yusuke
PY  - 2023/11/2
TI  - The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
JO  - JMIR Med Educ
SP  - e47532
VL  - 9
KW  - GPT-4
KW  - racial and ethnic bias
KW  - typical clinical vignettes
KW  - diagnosis
KW  - triage
KW  - artificial intelligence
KW  - AI
KW  - race
KW  - clinical vignettes
KW  - physician
KW  - efficiency
KW  - decision-making
KW  - bias
KW  - GPT
N2  - Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as ?correct? or ?incorrect.? Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients? race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4?s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. 
UR  - https://mededu.jmir.org/2023/1/e47532
UR  - http://dx.doi.org/10.2196/47532
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37917120
ID  - info:doi/10.2196/47532
ER  - 

TY  - JOUR
AU  - Baglivo, Francesco
AU  - De Angelis, Luigi
AU  - Casigliani, Virginia
AU  - Arzilli, Guglielmo
AU  - Privitera, Pierpaolo Gaetano
AU  - Rizzo, Caterina
PY  - 2023/11/1
TI  - Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study
JO  - JMIR Med Educ
SP  - e51421
VL  - 9
KW  - artificial intelligence
KW  - chatbots
KW  - medical education
KW  - vaccination
KW  - public health
KW  - medical students
KW  - large language model
KW  - generative AI
KW  - ChatGPT
KW  - Google Bard
KW  - AI chatbot
KW  - health education
KW  - health care
KW  - medical training
KW  - educational support tool
KW  - chatbot model
N2  - Background: Artificial intelligence (AI) is a rapidly developing field with the potential to transform various aspects of health care and public health, including medical training. During the ?Hygiene and Public Health? course for fifth-year medical students, a practical training session was conducted on vaccination using AI chatbots as an educational supportive tool. Before receiving specific training on vaccination, the students were given a web-based test extracted from the Italian National Medical Residency Test. After completing the test, a critical correction of each question was performed assisted by AI chatbots. Objective: The main aim of this study was to identify whether AI chatbots can be considered educational support tools for training in public health. The secondary objective was to assess the performance of different AI chatbots on complex multiple-choice medical questions in the Italian language. Methods: A test composed of 15 multiple-choice questions on vaccination was extracted from the Italian National Medical Residency Test using targeted keywords and administered to medical students via Google Forms and to different AI chatbot models (Bing Chat, ChatGPT, Chatsonic, Google Bard, and YouChat). The correction of the test was conducted in the classroom, focusing on the critical evaluation of the explanations provided by the chatbot. A Mann-Whitney U test was conducted to compare the performances of medical students and AI chatbots. Student feedback was collected anonymously at the end of the training experience. Results: In total, 36 medical students and 5 AI chatbot models completed the test. The students achieved an average score of 8.22 (SD 2.65) out of 15, while the AI chatbots scored an average of 12.22 (SD 2.77). The results indicated a statistically significant difference in performance between the 2 groups (U=49.5, P<.001), with a large effect size (r=0.69). When divided by question type (direct, scenario-based, and negative), significant differences were observed in direct (P<.001) and scenario-based (P<.001) questions, but not in negative questions (P=.48). The students reported a high level of satisfaction (7.9/10) with the educational experience, expressing a strong desire to repeat the experience (7.6/10). Conclusions: This study demonstrated the efficacy of AI chatbots in answering complex medical questions related to vaccination and providing valuable educational support. Their performance significantly surpassed that of medical students in direct and scenario-based questions. The responsible and critical use of AI chatbots can enhance medical education, making it an essential aspect to integrate into the educational system. 
UR  - https://mededu.jmir.org/2023/1/e51421
UR  - http://dx.doi.org/10.2196/51421
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37910155
ID  - info:doi/10.2196/51421
ER  - 

TY  - JOUR
AU  - Kunitsu, Yuki
PY  - 2023/10/30
TI  - The Potential of GPT-4 as a Support Tool for Pharmacists: Analytical Study Using the Japanese National Examination for Pharmacists
JO  - JMIR Med Educ
SP  - e48452
VL  - 9
KW  - natural language processing
KW  - generative pretrained transformer
KW  - GPT-4
KW  - ChatGPT
KW  - artificial intelligence
KW  - AI
KW  - chatbot
KW  - pharmacy
KW  - pharmacist
N2  - Background: The advancement of artificial intelligence (AI), as well as machine learning, has led to its application in various industries, including health care. AI chatbots, such as GPT-4, developed by OpenAI, have demonstrated potential in supporting health care professionals by providing medical information, answering examination questions, and assisting in medical education. However, the applicability of GPT-4 in the field of pharmacy remains unexplored. Objective: This study aimed to evaluate GPT-4?s ability to answer questions from the Japanese National Examination for Pharmacists (JNEP) and assess its potential as a support tool for pharmacists in their daily practice. Methods: The question texts and answer choices from the 107th and 108th JNEP, held in February 2022 and February 2023, were input into GPT-4. As GPT-4 cannot process diagrams, questions that included diagram interpretation were not analyzed and were initially given a score of 0. The correct answer rates were calculated and compared with the passing criteria of each examination to evaluate GPT-4?s performance. Results: For the 107th and 108th JNEP, GPT-4 achieved an accuracy rate of 64.5% (222/344) and 62.9% (217/345), respectively, for all questions. When considering only the questions that GPT-4 could answer, the accuracy rates increased to 78.2% (222/284) and 75.3% (217/287), respectively. The accuracy rates tended to be lower for physics, chemistry, and calculation questions. Conclusions: Although GPT-4 demonstrated the potential to answer questions from the JNEP and support pharmacists? capabilities, it also showed limitations in handling highly specialized questions, calculation questions, and questions requiring diagram recognition. Further evaluation is necessary to explore its applicability in real-world clinical settings, considering the complexities of patient scenarios and collaboration with health care professionals. By addressing these limitations, GPT-4 could become a more reliable tool for pharmacists in their daily practice. 
UR  - https://mededu.jmir.org/2023/1/e48452
UR  - http://dx.doi.org/10.2196/48452
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37837968
ID  - info:doi/10.2196/48452
ER  - 

TY  - JOUR
AU  - Preiksaitis, Carl
AU  - Rose, Christian
PY  - 2023/10/20
TI  - Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review
JO  - JMIR Med Educ
SP  - e48785
VL  - 9
KW  - medical education
KW  - artificial intelligence
KW  - ChatGPT
KW  - Bard
KW  - AI
KW  - educator
KW  - scoping
KW  - review
KW  - learner
KW  - generative
N2  - Background: Generative artificial intelligence (AI) technologies are increasingly being utilized across various fields, with considerable interest and concern regarding their potential application in medical education. These technologies, such as Chat GPT and Bard, can generate new content and have a wide range of possible applications. Objective: This study aimed to synthesize the potential opportunities and limitations of generative AI in medical education. It sought to identify prevalent themes within recent literature regarding potential applications and challenges of generative AI in medical education and use these to guide future areas for exploration. Methods: We conducted a scoping review, following the framework by Arksey and O'Malley, of English language articles published from 2022 onward that discussed generative AI in the context of medical education. A literature search was performed using PubMed, Web of Science, and Google Scholar databases. We screened articles for inclusion, extracted data from relevant studies, and completed a quantitative and qualitative synthesis of the data. Results: Thematic analysis revealed diverse potential applications for generative AI in medical education, including self-directed learning, simulation scenarios, and writing assistance. However, the literature also highlighted significant challenges, such as issues with academic integrity, data accuracy, and potential detriments to learning. Based on these themes and the current state of the literature, we propose the following 3 key areas for investigation: developing learners? skills to evaluate AI critically, rethinking assessment methodology, and studying human-AI interactions. Conclusions: The integration of generative AI in medical education presents exciting opportunities, alongside considerable challenges. There is a need to develop new skills and competencies related to AI as well as thoughtful, nuanced approaches to examine the growing use of generative AI in medical education. 
UR  - https://mededu.jmir.org/2023/1/e48785/
UR  - http://dx.doi.org/10.2196/48785
UR  - http://www.ncbi.nlm.nih.gov/pubmed/
ID  - info:doi/10.2196/48785
ER  - 

TY  - JOUR
AU  - Yanagita, Yasutaka
AU  - Yokokawa, Daiki
AU  - Uchida, Shun
AU  - Tawara, Junsuke
AU  - Ikusaka, Masatomi
PY  - 2023/10/13
TI  - Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
JO  - JMIR Form Res
SP  - e48023
VL  - 7
KW  - artificial intelligence
KW  - ChatGPT
KW  - GPT-4
KW  - AI
KW  - National Medical Licensing Examination
KW  - Japanese
KW  - NMLE
N2  - Background: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT?s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. Objective: The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. Methods: Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. Results: Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. Conclusions: GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information. 
UR  - https://formative.jmir.org/2023/1/e48023
UR  - http://dx.doi.org/10.2196/48023
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37831496
ID  - info:doi/10.2196/48023
ER  - 

TY  - JOUR
AU  - Flores-Cohaila, A. Javier
AU  - García-Vicente, Abigaíl
AU  - Vizcarra-Jiménez, F. Sonia
AU  - De la Cruz-Galán, P. Janith
AU  - Gutiérrez-Arratia, D. Jesús
AU  - Quiroga Torres, Geraldine Blanca
AU  - Taype-Rondan, Alvaro
PY  - 2023/9/28
TI  - Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
JO  - JMIR Med Educ
SP  - e48039
VL  - 9
KW  - medical education
KW  - generative pre-trained transformer
KW  - ChatGPT
KW  - licensing examination
KW  - assessment
KW  - Peru
KW  - Examen Nacional de Medicina
KW  - ENAM
KW  - learning model
KW  - artificial intelligence
KW  - AI
KW  - medical examination
N2  - Background: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries? national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. Objective: We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT. Methods: We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT?s accuracy. Results: GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (?=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%). Conclusions: Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy. 
UR  - https://mededu.jmir.org/2023/1/e48039
UR  - http://dx.doi.org/10.2196/48039
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37768724
ID  - info:doi/10.2196/48039
ER  - 

TY  - JOUR
AU  - Huang, ST Ryan
AU  - Lu, Qi Kevin Jia
AU  - Meaney, Christopher
AU  - Kemppainen, Joel
AU  - Punnett, Angela
AU  - Leung, Fok-Han
PY  - 2023/9/19
TI  - Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
JO  - JMIR Med Educ
SP  - e50514
VL  - 9
KW  - medical education
KW  - medical knowledge exam
KW  - artificial intelligence
KW  - AI
KW  - natural language processing
KW  - NLP
KW  - large language model
KW  - LLM
KW  - machine learning, ChatGPT
KW  - GPT-3.5
KW  - GPT-4
KW  - education
KW  - language model
KW  - education examination
KW  - testing
KW  - utility
KW  - family medicine
KW  - medical residents
KW  - test
KW  - community
N2  - Background: Large language model (LLM)?based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. Objective: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. Methods: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot?s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. Results: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). Conclusions: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services. 
UR  - https://mededu.jmir.org/2023/1/e50514
UR  - http://dx.doi.org/10.2196/50514
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37725411
ID  - info:doi/10.2196/50514
ER  - 

TY  - JOUR
AU  - Khlaif, N. Zuheir
AU  - Mousa, Allam
AU  - Hattab, Kamal Muayad
AU  - Itmazi, Jamil
AU  - Hassan, A. Amjad
AU  - Sanmugam, Mageswaran
AU  - Ayyoub, Abedalkarim
PY  - 2023/9/14
TI  - The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation
JO  - JMIR Med Educ
SP  - e47049
VL  - 9
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - scientific research
KW  - research ethics
N2  - Background: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal, education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing (NLP), which refers to the ability of computers to understand and generate human language. Objective: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose, high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing the application?s impact on the research framework, data analysis, and the literature review. The study also explored concerns around ownership and the integrity of research when using AI-generated text. Methods: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchers developed an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated using ChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitative data provided by the reviewers. Results: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality research that could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research framework and data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing. Conclusions: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used in different fields such as medical education to deliver materials to develop the basic competencies for both medicine students and faculty members. 
UR  - https://mededu.jmir.org/2023/1/e47049
UR  - http://dx.doi.org/10.2196/47049
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37707884
ID  - info:doi/10.2196/47049
ER  - 

TY  - JOUR
AU  - Sallam, Malik
AU  - Salim, A. Nesreen
AU  - Barakat, Muna
AU  - Al-Mahzoum, Kholoud
AU  - Al-Tammemi, B. Ala'a
AU  - Malaeb, Diana
AU  - Hallit, Rabih
AU  - Hallit, Souheil
PY  - 2023/9/5
TI  - Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study
JO  - JMIR Med Educ
SP  - e48254
VL  - 9
KW  - artificial intelligence
KW  - machine learning
KW  - education
KW  - technology
KW  - healthcare
KW  - survey
KW  - opinion
KW  - knowledge
KW  - practices
KW  - KAP
N2  - Background: ChatGPT is a conversational large language model that has the potential to revolutionize knowledge acquisition. However, the impact of this technology on the quality of education is still unknown considering the risks and concerns surrounding ChatGPT use. Therefore, it is necessary to assess the usability and acceptability of this promising tool. As an innovative technology, the intention to use ChatGPT can be studied in the context of the technology acceptance model (TAM). Objective: This study aimed to develop and validate a TAM-based survey instrument called TAME-ChatGPT (Technology Acceptance Model Edited to Assess ChatGPT Adoption) that could be employed to examine the successful integration and use of ChatGPT in health care education. Methods: The survey tool was created based on the TAM framework. It comprised 13 items for participants who heard of ChatGPT but did not use it and 23 items for participants who used ChatGPT. Using a convenient sampling approach, the survey link was circulated electronically among university students between February and March 2023. Exploratory factor analysis (EFA) was used to assess the construct validity of the survey instrument. Results: The final sample comprised 458 respondents, the majority among them undergraduate students (n=442, 96.5%). Only 109 (23.8%) respondents had heard of ChatGPT prior to participation and only 55 (11.3%) self-reported ChatGPT use before the study. EFA analysis on the attitude and usage scales showed significant Bartlett tests of sphericity scores (P<.001) and adequate Kaiser-Meyer-Olkin measures (0.823 for the attitude scale and 0.702 for the usage scale), confirming the factorability of the correlation matrices. The EFA showed that 3 constructs explained a cumulative total of 69.3% variance in the attitude scale, and these subscales represented perceived risks, attitude to technology/social influence, and anxiety. For the ChatGPT usage scale, EFA showed that 4 constructs explained a cumulative total of 72% variance in the data and comprised the perceived usefulness, perceived risks, perceived ease of use, and behavior/cognitive factors. All the ChatGPT attitude and usage subscales showed good reliability with Cronbach ? values >.78 for all the deduced subscales. Conclusions: The TAME-ChatGPT demonstrated good reliability, validity, and usefulness in assessing health care students? attitudes toward ChatGPT. The findings highlighted the importance of considering risk perceptions, usefulness, ease of use, attitudes toward technology, and behavioral factors when adopting ChatGPT as a tool in health care education. This information can aid the stakeholders in creating strategies to support the optimal and ethical use of ChatGPT and to identify the potential challenges hindering its successful implementation. Future research is recommended to guide the effective adoption of ChatGPT in health care education. 
UR  - https://mededu.jmir.org/2023/1/e48254
UR  - http://dx.doi.org/10.2196/48254
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37578934
ID  - info:doi/10.2196/48254
ER  - 

TY  - JOUR
AU  - Roos, Jonas
AU  - Kasapovic, Adnan
AU  - Jansen, Tom
AU  - Kaczmarczyk, Robert
PY  - 2023/9/4
TI  - Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany
JO  - JMIR Med Educ
SP  - e46482
VL  - 9
KW  - medical education
KW  - state examinations
KW  - exams
KW  - large language models
KW  - artificial intelligence
KW  - ChatGPT
N2  - Background: Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  Objective: This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  Methods: The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  Results: GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  Conclusions: LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.  
UR  - https://mededu.jmir.org/2023/1/e46482
UR  - http://dx.doi.org/10.2196/46482
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37665620
ID  - info:doi/10.2196/46482
ER  - 

TY  - JOUR
AU  - Leung, I. Tiffany
AU  - Sagar, Ankita
AU  - Shroff, Swati
AU  - Henry, L. Tracey
PY  - 2023/8/23
TI  - Can AI Mitigate Bias in Writing Letters of Recommendation?
JO  - JMIR Med Educ
SP  - e51494
VL  - 9
KW  - sponsorship
KW  - implicit bias
KW  - gender bias
KW  - bias
KW  - letters of recommendation
KW  - artificial intelligence
KW  - large language models
KW  - medical education
KW  - career advancement
KW  - tenure and promotion
KW  - promotion
KW  - leadership
UR  - https://mededu.jmir.org/2023/1/e51494
UR  - http://dx.doi.org/10.2196/51494
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37610808
ID  - info:doi/10.2196/51494
ER  - 

TY  - JOUR
AU  - Hsu, Hsing-Yu
AU  - Hsu, Kai-Cheng
AU  - Hou, Shih-Yen
AU  - Wu, Ching-Lung
AU  - Hsieh, Yow-Wen
AU  - Cheng, Yih-Dih
PY  - 2023/8/21
TI  - Examining Real-World Medication Consultations and Drug-Herb Interactions: ChatGPT Performance Evaluation
JO  - JMIR Med Educ
SP  - e48433
VL  - 9
KW  - ChatGPT
KW  - large language model
KW  - natural language processing
KW  - real-world medication consultation questions
KW  - NLP
KW  - drug-herb interactions
KW  - pharmacist
KW  - LLM
KW  - language models
KW  - chat generative pre-trained transformer
N2  - Background: Since OpenAI released ChatGPT, with its strong capability in handling natural tasks and its user-friendly interface, it has garnered significant attention. Objective: A prospective analysis is required to evaluate the accuracy and appropriateness of medication consultation responses generated by ChatGPT. Methods: A prospective cross-sectional study was conducted by the pharmacy department of a medical center in Taiwan. The test data set comprised retrospective medication consultation questions collected from February 1, 2023, to February 28, 2023, along with common questions about drug-herb interactions. Two distinct sets of questions were tested: real-world medication consultation questions and common questions about interactions between traditional Chinese and Western medicines. We used the conventional double-review mechanism. The appropriateness of each response from ChatGPT was assessed by 2 experienced pharmacists. In the event of a discrepancy between the assessments, a third pharmacist stepped in to make the final decision. Results: Of 293 real-world medication consultation questions, a random selection of 80 was used to evaluate ChatGPT?s performance. ChatGPT exhibited a higher appropriateness rate in responding to public medication consultation questions compared to those asked by health care providers in a hospital setting (31/51, 61% vs 20/51, 39%; P=.01). Conclusions: The findings from this study suggest that ChatGPT could potentially be used for answering basic medication consultation questions. Our analysis of the erroneous information allowed us to identify potential medical risks associated with certain questions; this problem deserves our close attention. 
UR  - https://mededu.jmir.org/2023/1/e48433
UR  - http://dx.doi.org/10.2196/48433
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37561097
ID  - info:doi/10.2196/48433
ER  - 

TY  - JOUR
AU  - Lee, Hyeonhoon
PY  - 2023/8/17
TI  - Using ChatGPT as a Learning Tool in Acupuncture Education: Comparative Study
JO  - JMIR Med Educ
SP  - e47427
VL  - 9
KW  - ChatGPT
KW  - educational tool
KW  - artificial intelligence
KW  - acupuncture
KW  - AI
KW  - personalized education
KW  - students
N2  - Background: ChatGPT (Open AI) is a state-of-the-art artificial intelligence model with potential applications in the medical fields of clinical practice, research, and education. Objective: This study aimed to evaluate the potential of ChatGPT as an educational tool in college acupuncture programs, focusing on its ability to support students in learning acupuncture point selection, treatment planning, and decision-making. Methods: We collected case studies published in Acupuncture in Medicine between June 2022 and May 2023. Both ChatGPT-3.5 and ChatGPT-4 were used to generate suggestions for acupuncture points based on case presentations. A Wilcoxon signed-rank test was conducted to compare the number of acupuncture points generated by ChatGPT-3.5 and ChatGPT-4, and the overlapping ratio of acupuncture points was calculated. Results: Among the 21 case studies, 14 studies were included for analysis. ChatGPT-4 generated significantly more acupuncture points (9.0, SD 1.1) compared to ChatGPT-3.5 (5.6, SD 0.6; P<.001). The overlapping ratios of acupuncture points for ChatGPT-3.5 (0.40, SD 0.28) and ChatGPT-4 (0.34, SD 0.27; P=.67) were not significantly different. Conclusions: ChatGPT may be a useful educational tool for acupuncture students, providing valuable insights into personalized treatment plans. However, it cannot fully replace traditional diagnostic methods, and further studies are needed to ensure its safe and effective implementation in acupuncture education. 
UR  - https://mededu.jmir.org/2023/1/e47427
UR  - http://dx.doi.org/10.2196/47427
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37590034
ID  - info:doi/10.2196/47427
ER  - 

TY  - JOUR
AU  - Borchert, J. Robin
AU  - Hickman, R. Charlotte
AU  - Pepys, Jack
AU  - Sadler, J. Timothy
PY  - 2023/8/7
TI  - Performance of ChatGPT on the Situational Judgement Test?A Professional Dilemmas?Based Examination for Doctors in the United Kingdom
JO  - JMIR Med Educ
SP  - e48978
VL  - 9
KW  - ChatGPT
KW  - language models
KW  - Situational Judgement Test
KW  - medical education
KW  - artificial intelligence
KW  - language model
KW  - exam
KW  - examination
KW  - SJT
KW  - judgement
KW  - reasoning
KW  - communication
KW  - chatbot
N2  - Background: ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors. Objective: We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics. Methods: All questions from the UK Foundation Programme Office?s (UKFPO?s) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT?s answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated. Results: Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT?s situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors. Conclusions: Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics. 
UR  - https://mededu.jmir.org/2023/1/e48978
UR  - http://dx.doi.org/10.2196/48978
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37548997
ID  - info:doi/10.2196/48978
ER  - 

TY  - JOUR
AU  - Gilson, Aidan
AU  - Safranek, W. Conrad
AU  - Huang, Thomas
AU  - Socrates, Vimig
AU  - Chi, Ling
AU  - Taylor, Andrew Richard
AU  - Chartash, David
PY  - 2023/7/13
TI  - Authors? Reply to: Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations
JO  - JMIR Med Educ
SP  - e50336
VL  - 9
KW  - natural language processing
KW  - NLP
KW  - MedQA
KW  - generative pre-trained transformer
KW  - GPT
KW  - medical education
KW  - chatbot
KW  - artificial intelligence
KW  - AI
KW  - education technology
KW  - ChatGPT
KW  - conversational agent
KW  - machine learning
KW  - large language models
KW  - knowledge assessment
UR  - https://mededu.jmir.org/2023/1/e50336
UR  - http://dx.doi.org/10.2196/50336
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37440299
ID  - info:doi/10.2196/50336
ER  - 

TY  - JOUR
AU  - Epstein, H. Richard
AU  - Dexter, Franklin
PY  - 2023/7/13
TI  - Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations. Comment on ?How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment?
JO  - JMIR Med Educ
SP  - e48305
VL  - 9
KW  - natural language processing
KW  - NLP
KW  - MedQA
KW  - generative pre-trained transformer
KW  - GPT
KW  - medical education
KW  - chatbot
KW  - artificial intelligence
KW  - AI
KW  - education technology
KW  - ChatGPT
KW  - Google Bard
KW  - conversational agent
KW  - machine learning
KW  - large language models
KW  - knowledge assessment
UR  - https://mededu.jmir.org/2023/1/e48305
UR  - http://dx.doi.org/10.2196/48305
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37440293
ID  - info:doi/10.2196/48305
ER  - 

TY  - JOUR
AU  - Seth, Puneet
AU  - Hueppchen, Nancy
AU  - Miller, D. Steven
AU  - Rudzicz, Frank
AU  - Ding, Jerry
AU  - Parakh, Kapil
AU  - Record, D. Janet
PY  - 2023/7/11
TI  - Data Science as a Core Competency in Undergraduate Medical Education in the Age of Artificial Intelligence in Health Care
JO  - JMIR Med Educ
SP  - e46344
VL  - 9
KW  - data science
KW  - medical education
KW  - machine learning
KW  - health data
KW  - artificial intelligence
KW  - AI
KW  - application
KW  - health care delivery
KW  - health care
KW  - develop
KW  - medical educators
KW  - physician
KW  - education
KW  - training
KW  - barriers
KW  - optimize
KW  - integration
KW  - competency
UR  - https://mededu.jmir.org/2023/1/e46344
UR  - http://dx.doi.org/10.2196/46344
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37432728
ID  - info:doi/10.2196/46344
ER  - 

TY  - JOUR
AU  - Nov, Oded
AU  - Singh, Nina
AU  - Mann, Devin
PY  - 2023/7/10
TI  - Putting ChatGPT?s Medical Advice to the (Turing) Test: Survey Study
JO  - JMIR Med Educ
SP  - e46939
VL  - 9
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - large language model
KW  - patient-provider interaction
KW  - chatbot
KW  - feasibility
KW  - ethics
KW  - privacy
KW  - language model
KW  - machine learning
N2  - Background: Chatbots are being piloted to draft responses to patient questions, but patients? ability to distinguish between provider and chatbot responses and patients? trust in chatbots? functions are not well established. Objective: This study aimed to assess the feasibility of using ChatGPT (Chat Generative Pre-trained Transformer) or a similar artificial intelligence?based chatbot for patient-provider communication. Methods: A survey study was conducted in January 2023. Ten representative, nonadministrative patient-provider interactions were extracted from the electronic health record. Patients? questions were entered into ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider?s response. In the survey, each patient question was followed by a provider- or ChatGPT-generated response. Participants were informed that 5 responses were provider generated and 5 were chatbot generated. Participants were asked?and incentivized financially?to correctly identify the response source. Participants were also asked about their trust in chatbots? functions in patient-provider communication, using a Likert scale from 1-5. Results: A US-representative sample of 430 study participants aged 18 and older were recruited on Prolific, a crowdsourcing platform for academic studies. In all, 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. Overall, 53.3% (209/392) of respondents analyzed were women, and the average age was 47.1 (range 18-91) years. The correct classification of responses ranged between 49% (192/392) to 85.7% (336/392) for different questions. On average, chatbot responses were identified correctly in 65.5% (1284/1960) of the cases, and human provider responses were identified correctly in 65.1% (1276/1960) of the cases. On average, responses toward patients? trust in chatbots? functions were weakly positive (mean Likert score 3.4 out of 5), with lower trust as the health-related complexity of the task in the questions increased. Conclusions: ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower-risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in health care. 
UR  - https://mededu.jmir.org/2023/1/e46939
UR  - http://dx.doi.org/10.2196/46939
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37428540
ID  - info:doi/10.2196/46939
ER  - 

TY  - JOUR
AU  - Takagi, Soshi
AU  - Watari, Takashi
AU  - Erabi, Ayano
AU  - Sakaguchi, Kota
PY  - 2023/6/29
TI  - Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
JO  - JMIR Med Educ
SP  - e48002
VL  - 9
KW  - ChatGPT
KW  - Chat Generative Pre-trained Transformer
KW  - GPT-4
KW  - Generative Pre-trained Transformer 4
KW  - artificial intelligence
KW  - AI
KW  - medical education
KW  - Japanese Medical Licensing Examination
KW  - medical licensing
KW  - clinical support
KW  - learning model
N2  - Background: The competence of ChatGPT (Chat Generative Pre-Trained Transformer) in non-English languages is not well studied. Objective: This study compared the performances of GPT-3.5 (Generative Pre-trained Transformer) and GPT-4 on the Japanese Medical Licensing Examination (JMLE) to evaluate the reliability of these models for clinical reasoning and medical knowledge in non-English languages. Methods: This study used the default mode of ChatGPT, which is based on GPT-3.5; the GPT-4 model of ChatGPT Plus; and the 117th JMLE in 2023. A total of 254 questions were included in the final analysis, which were categorized into 3 types, namely general, clinical, and clinical sentence questions. Results: The results indicated that GPT-4 outperformed GPT-3.5 in terms of accuracy, particularly for general, clinical, and clinical sentence questions. GPT-4 also performed better on difficult questions and specific disease questions. Furthermore, GPT-4 achieved the passing criteria for the JMLE, indicating its reliability for clinical reasoning and medical knowledge in non-English languages. Conclusions: GPT-4 could become a valuable tool for medical education and clinical support in non?English-speaking regions, such as Japan. 
UR  - https://mededu.jmir.org/2023/1/e48002
UR  - http://dx.doi.org/10.2196/48002
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37384388
ID  - info:doi/10.2196/48002
ER  - 

TY  - JOUR
AU  - Karabacak, Mert
AU  - Ozkara, Berksu Burak
AU  - Margetis, Konstantinos
AU  - Wintermark, Max
AU  - Bisdas, Sotirios
PY  - 2023/6/6
TI  - The Advent of Generative Language Models in Medical Education
JO  - JMIR Med Educ
SP  - e48163
VL  - 9
KW  - generative language model
KW  - artificial intelligence
KW  - medical education
KW  - ChatGPT
KW  - academic integrity
KW  - AI-driven feedback
KW  - stimulation
KW  - evaluation
KW  - technology
KW  - learning environment
KW  - medical student
UR  - https://mededu.jmir.org/2023/1/e48163
UR  - http://dx.doi.org/10.2196/48163
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37279048
ID  - info:doi/10.2196/48163
ER  - 

TY  - JOUR
AU  - Abd-alrazaq, Alaa
AU  - AlSaad, Rawan
AU  - Alhuwail, Dari
AU  - Ahmed, Arfan
AU  - Healy, Mark Padraig
AU  - Latifi, Syed
AU  - Aziz, Sarah
AU  - Damseh, Rafat
AU  - Alabed Alrazak, Sadam
AU  - Sheikh, Javaid
PY  - 2023/6/1
TI  - Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions
JO  - JMIR Med Educ
SP  - e48291
VL  - 9
KW  - large language models
KW  - artificial intelligence
KW  - medical education
KW  - ChatGPT
KW  - GPT-4
KW  - generative AI
KW  - students
KW  - educators
UR  - https://mededu.jmir.org/2023/1/e48291
UR  - http://dx.doi.org/10.2196/48291
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37261894
ID  - info:doi/10.2196/48291
ER  - 

TY  - JOUR
AU  - Giannos, Panagiotis
AU  - Delardas, Orestis
PY  - 2023/4/26
TI  - Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations
JO  - JMIR Med Educ
SP  - e47737
VL  - 9
KW  - standardized admissions tests
KW  - GPT
KW  - ChatGPT
KW  - medical education
KW  - medicine
KW  - law
KW  - natural language processing
KW  - BMAT
KW  - TMUA
KW  - LNAT
KW  - TSA
N2  - Background: Large language models, such as ChatGPT by OpenAI, have demonstrated potential in various applications, including medical education. Previous studies have assessed ChatGPT?s performance in university or professional settings. However, the model?s potential in the context of standardized admission tests remains unexplored. Objective: This study evaluated ChatGPT?s performance on standardized admission tests in the United Kingdom, including the BioMedical Admissions Test (BMAT), Test of Mathematics for University Admission (TMUA), Law National Aptitude Test (LNAT), and Thinking Skills Assessment (TSA), to understand its potential as an innovative tool for education and test preparation. Methods: Recent public resources (2019-2022) were used to compile a data set of 509 questions from the BMAT, TMUA, LNAT, and TSA covering diverse topics in aptitude, scientific knowledge and applications, mathematical thinking and reasoning, critical thinking, problem-solving, reading comprehension, and logical reasoning. This evaluation assessed ChatGPT?s performance using the legacy GPT-3.5 model, focusing on multiple-choice questions for consistency. The model?s performance was analyzed based on question difficulty, the proportion of correct responses when aggregating exams from all years, and a comparison of test scores between papers of the same exam using binomial distribution and paired-sample (2-tailed) t tests. Results: The proportion of correct responses was significantly lower than incorrect ones in BMAT section 2 (P<.001) and TMUA paper 1 (P<.001) and paper 2 (P<.001). No significant differences were observed in BMAT section 1 (P=.2), TSA section 1 (P=.7), or LNAT papers 1 and 2, section A (P=.3). ChatGPT performed better in BMAT section 1 than section 2 (P=.047), with a maximum candidate ranking of 73% compared to a minimum of 1%. In the TMUA, it engaged with questions but had limited accuracy and no performance difference between papers (P=.6), with candidate rankings below 10%. In the LNAT, it demonstrated moderate success, especially in paper 2?s questions; however, student performance data were unavailable. TSA performance varied across years with generally moderate results and fluctuating candidate rankings. Similar trends were observed for easy to moderate difficulty questions (BMAT section 1, P=.3; BMAT section 2, P=.04; TMUA paper 1, P<.001; TMUA paper 2, P=.003; TSA section 1, P=.8; and LNAT papers 1 and 2, section A, P>.99) and hard to challenging ones (BMAT section 1, P=.7; BMAT section 2, P<.001; TMUA paper 1, P=.007; TMUA paper 2, P<.001; TSA section 1, P=.3; and LNAT papers 1 and 2, section A, P=.2). Conclusions: ChatGPT shows promise as a supplementary tool for subject areas and test formats that assess aptitude, problem-solving and critical thinking, and reading comprehension. However, its limitations in areas such as scientific and mathematical knowledge and applications highlight the need for continuous development and integration with conventional learning strategies in order to fully harness its potential. 
UR  - https://mededu.jmir.org/2023/1/e47737
UR  - http://dx.doi.org/10.2196/47737
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37099373
ID  - info:doi/10.2196/47737
ER  - 

TY  - JOUR
AU  - Thirunavukarasu, James Arun
AU  - Hassan, Refaat
AU  - Mahmood, Shathar
AU  - Sanghera, Rohan
AU  - Barzangi, Kara
AU  - El Mukashfi, Mohanned
AU  - Shah, Sachin
PY  - 2023/4/21
TI  - Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
JO  - JMIR Med Educ
SP  - e46599
VL  - 9
KW  - ChatGPT
KW  - large language model
KW  - natural language processing
KW  - decision support techniques
KW  - artificial intelligence
KW  - AI
KW  - deep learning
KW  - primary care
KW  - general practice
KW  - family medicine
KW  - chatbot
N2  - Background: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model?s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners? reports from 2018 to 2022. Novel explanations from ChatGPT?defined as information provided that was not inputted within the question or multiple answer choices?were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT?s strengths and weaknesses. Results: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT?s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ?=?0.241 and ?0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions: Large language models are approaching human expert?level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis. 
UR  - https://mededu.jmir.org/2023/1/e46599
UR  - http://dx.doi.org/10.2196/46599
UR  - http://www.ncbi.nlm.nih.gov/pubmed/37083633
ID  - info:doi/10.2196/46599
ER  - 

TY  - JOUR
AU  - Adams, C. Lisa
AU  - Busch, Felix
AU  - Truhn, Daniel
AU  - Makowski, R. Marcus
AU  - Aerts, L. Hugo J. W.
AU  - Bressem, K. Keno
PY  - 2023/3/16
TI  - What Does DALL-E 2 Know About Radiology?
JO  - J Med Internet Res
SP  - e43110
VL  - 25
KW  - DALL-E
KW  - creating images from text
KW  - image creation
KW  - image generation
KW  - transformer language model
KW  - machine learning
KW  - generative model
KW  - radiology
KW  - x-ray
KW  - artificial intelligence
KW  - medical imaging
KW  - text-to-image
KW  - diagnostic imaging
UR  - https://www.jmir.org/2023/1/e43110
UR  - http://dx.doi.org/10.2196/43110
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36927634
ID  - info:doi/10.2196/43110
ER  - 

TY  - JOUR
AU  - Sabry Abdel-Messih, Mary
AU  - Kamel Boulos, N. Maged
PY  - 2023/3/8
TI  - ChatGPT in Clinical Toxicology
JO  - JMIR Med Educ
SP  - e46876
VL  - 9
KW  - ChatGPT
KW  - clinical toxicology
KW  - organophosphates
KW  - artificial intelligence
KW  - AI
KW  - medical education
UR  - https://mededu.jmir.org/2023/1/e46876
UR  - http://dx.doi.org/10.2196/46876
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36867743
ID  - info:doi/10.2196/46876
ER  - 

TY  - JOUR
AU  - Eysenbach, Gunther
PY  - 2023/3/6
TI  - The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers
JO  - JMIR Med Educ
SP  - e46885
VL  - 9
KW  - artificial intelligence
KW  - AI
KW  - ChatGPT
KW  - generative language model
KW  - medical education
KW  - interview
KW  - future of education
UR  - https://mededu.jmir.org/2023/1/e46885
UR  - http://dx.doi.org/10.2196/46885
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36863937
ID  - info:doi/10.2196/46885
ER  - 

TY  - JOUR
AU  - Gilson, Aidan
AU  - Safranek, W. Conrad
AU  - Huang, Thomas
AU  - Socrates, Vimig
AU  - Chi, Ling
AU  - Taylor, Andrew Richard
AU  - Chartash, David
PY  - 2023/2/8
TI  - How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment
JO  - JMIR Med Educ
SP  - e45312
VL  - 9
KW  - natural language processing
KW  - NLP
KW  - MedQA
KW  - generative pre-trained transformer
KW  - GPT
KW  - medical education
KW  - chatbot
KW  - artificial intelligence
KW  - education technology
KW  - ChatGPT
KW  - conversational agent
KW  - machine learning
KW  - USMLE
N2  - Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT?s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT?s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT?s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT?s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. 
UR  - https://mededu.jmir.org/2023/1/e45312
UR  - http://dx.doi.org/10.2196/45312
UR  - http://www.ncbi.nlm.nih.gov/pubmed/36753318
ID  - info:doi/10.2196/45312
ER  -