%0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e67244 %T Large Language Models in Biochemistry Education: Comparative Evaluation of Performance %A Bolgova,Olena %A Shypilova,Inna %A Mavrych,Volodymyr %K ChatGPT %K Claude %K Gemini %K Copilot %K biochemistry %K LLM %K medical education %K artificial intelligence %K NLP %K natural language processing %K machine learning %K large language model %K AI %K ML %K comprehensive analysis %K medical students %K GPT-4 %K questionnaire %K medical course %K bioenergetics %D 2025 %7 10.4.2025 %9 %J JMIR Med Educ %G English %X Background: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. Objective: The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots—Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)—against the academic results of medical students in the medical biochemistry course. Methods: We used 200 USMLE (United States Medical Licensing Examination)–style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4‐1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data’s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05. Results: On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students’ performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04). Conclusions: Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment. %R 10.2196/67244 %U https://mededu.jmir.org/2025/1/e67244 %U https://doi.org/10.2196/67244 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e65726 %T Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG %A Kıyak,Yavuz Selim %A Kononowicz,Andrzej A %K automatic item generation %K ChatGPT %K artificial intelligence %K large language models %K medical education %K AI %K hybrid %K template-based method %K hybrid AIG %K mixed-method %K multiple-choice question %K multiple-choice %K human-AI collaboration %K human-AI %K medical education %K algorithm %K expert %D 2025 %7 4.4.2025 %9 %J JMIR Form Res %G English %X Background: Template-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items. Objective: We aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education. Methods: This is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population. Results: The hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging. Conclusions: The hybrid AIG method transcends the traditional template-based approach by marrying the “art” that comes from AI as a “black box” with the “science” of algorithmic generation under the oversight of expert as a “marriage registrar”. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education. %R 10.2196/65726 %U https://formative.jmir.org/2025/1/e65726 %U https://doi.org/10.2196/65726 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e71844 %T Knowledge Mapping and Global Trends in Simulation in Medical Education: Bibliometric and Visual Analysis %A Ba,Hongjun %A Zhang,Lili %A He,Xiufang %A Li,Shujuan %+ Department of Pediatrics, The First Affiliated Hospital, Sun Yat-sen University, 58 Zhongshan Road 2, Guangzhou, 510080, China, 86 15920109625, bahj3@mail.sysu.edu.cn %K medical education %K simulation-based teaching %K bibliometrics %K visualization analysis %K knowledge mapping %D 2025 %7 26.3.2025 %9 Original Paper %J JMIR Med Educ %G English %X Background: With the increasing recognition of the importance of simulation-based teaching in medical education, research in this field has developed rapidly. To comprehensively understand the research dynamics and trends in this area, we conducted an analysis of knowledge mapping and global trends. Objective: This study aims to reveal the research hotspots and development trends in the field of simulation-based teaching in medical education from 2004 to 2024 through bibliometric and visualization analyses. Methods: Using CiteSpace and VOSviewer, we conducted bibliometric and visualization analyses of 6743 articles related to simulation-based teaching in medical education, published in core journals from 2004 to 2024. The analysis included publication trends, contributions by countries and institutions, author contributions, keyword co-occurrence and clustering, and keyword bursts. Results: From 2004 to 2008, the number of articles published annually did not exceed 100. However, starting from 2009, the number increased year by year, reaching a peak of 850 articles in 2024, indicating rapid development in this research field. The United States, Canada, the United Kingdom, Australia, and China published the most articles. Harvard University emerged as a research hub with 1799 collaborative links, although the overall collaboration density was low. Among the 6743 core journal articles, a total of 858 authors were involved, with Lars Konge and Adam Dubrowski being the most prolific. However, collaboration density was low, and the collaboration network was relatively dispersed. A total of 812 common keywords were identified, forming 4189 links. The keywords “medical education,” “education,” and “simulation” had the highest frequency of occurrence. Cluster analysis indicated that “cardiopulmonary resuscitation” and “surgical education” were major research hotspots. From 2004 to 2024, a total of 20 burst keywords were identified, among which “patient simulation,” “randomized controlled trial,” “clinical competence,” and “deliberate practice” had high burst strength. In recent years, “application of simulation in medical education,” “3D printing,” “augmented reality,” and “simulation training” have become research frontiers. Conclusions: Research on the application of simulation-based teaching in medical education has become a hotspot, with expanding research areas and hotspots. Future research should strengthen interinstitutional collaboration and focus on the application of emerging technologies in simulation-based teaching. %M 40139212 %R 10.2196/71844 %U https://mededu.jmir.org/2025/1/e71844 %U https://doi.org/10.2196/71844 %U http://www.ncbi.nlm.nih.gov/pubmed/40139212 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58375 %T Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination %A Madrid,Julian %A Diehl,Philipp %A Selig,Mischa %A Rolauffs,Bernd %A Hans,Felix Patricius %A Busch,Hans-Jörg %A Scheef,Tobias %A Benning,Leo %K medical education %K artificial intelligence %K generative AI %K large language model %K LLM %K ChatGPT %K GPT-4 %K board licensing examination %K professional education %K examination %K student %K experimental %K bootstrapping %K confidence interval %D 2025 %7 21.3.2025 %9 %J JMIR Med Educ %G English %X Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed “confidence accuracy” to evaluate it. Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain. %R 10.2196/58375 %U https://mededu.jmir.org/2025/1/e58375 %U https://doi.org/10.2196/58375 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e68309 %T Feedback From Dental Students Using Two Alternate Coaching Methods: Qualitative Focus Group Study %A Alreshaid,Lulwah %A Alkattan,Rana %K student feedback %K coaching %K dental education %K student evaluation %K teaching methods %K educational intervention %D 2025 %7 18.3.2025 %9 %J JMIR Med Educ %G English %X Background: Student feedback is crucial for evaluating the effectiveness of institutions. However, implementing feedback can be challenging due to practical difficulties. While student feedback on courses can improve teaching, there is a debate about its effectiveness if not well-written to provide helpful information to the receiver. Objective: This study aimed to evaluate the impact of coaching on proper feedback given by dental students in Saudi Arabia. Methods: A total of 47 first-year dental students from a public dental school in Riyadh, Saudi Arabia, completed 3 surveys throughout the academic year. The surveys assessed their feedback on a Dental Anatomy and Operative Dentistry course, including their feedback on the lectures, practical sessions, examinations, and overall experience. The surveys focused on assessing student feedback on the knowledge, understanding, and practical skills achieved during the course, as aligned with the defined course learning outcomes. The surveys were distributed without coaching, after handout coaching and after workshop coaching on how to provide feedback, designated as survey #1, survey #2, and survey #3, respectively. The same group of students received all 3 surveys consecutively (repeated measures design). The responses were then rated as neutral, positive, negative, or constructive by 2 raters. The feedback was analyzed using McNemar test to compare the effectiveness of the different coaching approaches. Results: While no significant changes were found between the first 2 surveys, a significant increase in constructive feedback was observed in survey #3 after workshop coaching compared with both other surveys (P<.001). The results also showed a higher proportion of desired changes in feedback, defined as any change from positive, negative, or neutral to constructive, after survey #3 (P<.001). Overall, 20.2% reported desired changes at survey #2% and 41.5% at survey #3 compared with survey #1. Conclusions: This study suggests that workshops on feedback coaching can effectively improve the quality of feedback provided by dental students. Incorporating feedback coaching into dental school curricula could help students communicate their concerns more effectively, ultimately enhancing the learning experience. %R 10.2196/68309 %U https://mededu.jmir.org/2025/1/e68309 %U https://doi.org/10.2196/68309 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e67696 %T Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study %A Pastrak,Mila %A Kajitani,Sten %A Goodings,Anthony James %A Drewek,Austin %A LaFree,Andrew %A Murphy,Adrian %K artificial intelligence %K ChatGPT-4 %K medical education %K emergency medicine %K examination %K examination preparation %D 2025 %7 12.3.2025 %9 %J JMIR AI %G English %X Background: The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear. Objective: This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation. Methods: A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed. Results: The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability. Conclusions: The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form. %R 10.2196/67696 %U https://ai.jmir.org/2025/1/e67696 %U https://doi.org/10.2196/67696 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e57634 %T Comparison of Learning Outcomes Among Medical Students in Thailand to Determine the Right Time to Teach Forensic Medicine: Retrospective Study %A Chudoung,Ubon %A Saengon,Wilaipon %A Peonim,Vichan %A Worasuwannarak,Wisarn %K multiple-choice question %K MCQ %K forensic medicine %K preclinic %K clinic %K medical student %D 2025 %7 10.2.2025 %9 %J JMIR Med Educ %G English %X Background: Forensic medicine requires background medical knowledge and the ability to apply it to legal cases. Medical students have different levels of medical knowledge and are therefore likely to perform differently when learning forensic medicine. However, different medical curricula in Thailand deliver forensic medicine courses at different stages of medical study; most curricula deliver these courses in the clinical years, while others offer them in the preclinical years. This raises questions about the differences in learning effectiveness. Objective: We aimed to compare the learning outcomes of medical students in curricula that either teach forensic medicine at the clinical level or teach it at the preclinical level. Methods: This was a 5-year retrospective study that compared multiple-choice question (MCQ) scores in a forensic medicine course for fifth- and third-year medical students. The fifth-year students’ program was different from that of the third-year students, but both programs were offered by Mahidol University. The students were taught forensic medicine by the same instructors, used similar content, and were evaluated via examinations of similar difficulty. Of the 1063 medical students included in this study, 782 were fifth-year clinical students, and 281 were third-year preclinical students. Results: The average scores of the fifth- and third-year medical students were 76.09% (SD 6.75%) and 62.94% (SD 8.33%), respectively. The difference was statistically significant (Kruskal-Wallis test: P<.001). Additionally, the average score of fifth-year medical students was significantly higher than that of third-year students in every academic year (all P values were <.001). Conclusions: Teaching forensic medicine during the preclinical years may be too early, and preclinical students may not understand the clinical content sufficiently. Attention should be paid to ensuring that students have the adequate clinical background before teaching subjects that require clinical applications, especially in forensic medicine. %R 10.2196/57634 %U https://mededu.jmir.org/2025/1/e57634 %U https://doi.org/10.2196/57634 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e57424 %T Barriers to and Facilitators of Implementing Team-Based Extracorporeal Membrane Oxygenation Simulation Study: Exploratory Analysis %A Brown,Joan %A De-Oliveira,Sophia %A Mitchell,Christopher %A Cesar,Rachel Carmen %A Ding,Li %A Fix,Melissa %A Stemen,Daniel %A Yacharn,Krisda %A Wong,Se Fum %A Dhillon,Anahat %K intensive care unit %K ICU %K teamwork in the ICU %K team dynamics %K collaboration %K interprofessional collaboration %K simulation %K simulation training %K ECMO %K extracorporeal membrane oxygenation %K life support %K cardiorespiratory dysfunction %K cardiorespiratory %K cardiology %K respiratory %K heart %K lungs %D 2025 %7 24.1.2025 %9 %J JMIR Med Educ %G English %X Introduction: Extracorporeal membrane oxygenation (ECMO) is a critical tool in the care of severe cardiorespiratory dysfunction. Simulation training for ECMO has become standard practice. Therefore, Keck Medicine of the University of California (USC) holds simulation-training sessions to reinforce and improve providers knowledge. Objective: This study aimed to understand the impact of simulation training approaches on interprofessional collaboration. We believed simulation-based ECMO training would improve interprofessional collaboration through increased communication and enhance teamwork. Methods: This was a single-center, mixed methods study of the Cardiac and Vascular Institute Intensive Care Unit at Keck Medicine of USC conducted from September 2021 to April 2023. Simulation training was offered for 1 hour monthly to the clinical team focused on the collaboration and decision-making needed to evaluate the initiation of ECMO therapy. Electronic surveys were distributed before, after, and 3 months post training. The survey evaluated teamwork and the effectiveness of training, and focus groups were held to understand social environment factors. Additionally, trainee and peer evaluation focus groups were held to understand socioenvironmental factors. Results: In total, 37 trainees attended the training simulation from August 2021 to August 2022. Using 27 records for exploratory factor analysis, the standardized Cronbach α was 0.717. The survey results descriptively demonstrated a positive shift in teamwork ability. Qualitative themes identified improved confidence and decision-making. Conclusions: The study design was flawed, indicating improvement opportunities for future research on simulation training in the clinical setting. The paper outlines what to avoid when designing and implementing studies that assess an educational intervention in a complex clinical setting. The hypothesis deserves further exploration and is supported by the results of this study. %R 10.2196/57424 %U https://mededu.jmir.org/2025/1/e57424 %U https://doi.org/10.2196/57424 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e56850 %T Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study %A Wang,Ying-Mei %A Shen,Hung-Wei %A Chen,Tzeng-Ji %A Chiang,Shu-Chiung %A Lin,Ting-Guan %K artificial intelligence %K ChatGPT %K chat generative pre-trained transformer %K GPT-4 %K medical education %K educational measurement %K pharmacy licensure %K Taiwan %K Taiwan national pharmacist licensing examination %K learning model %K AI %K Chatbot %K pharmacist %K evaluation and comparison study %K pharmacy %K statistical analyses %K medical databases %K medical decision-making %K generative AI %K machine learning %D 2025 %7 17.1.2025 %9 %J JMIR Med Educ %G English %X Background: OpenAI released versions ChatGPT-3.5 and GPT-4 between 2022 and 2023. GPT-3.5 has demonstrated proficiency in various examinations, particularly the United States Medical Licensing Examination. However, GPT-4 has more advanced capabilities. Objective: This study aims to examine the efficacy of GPT-3.5 and GPT-4 within the Taiwan National Pharmacist Licensing Examination and to ascertain their utility and potential application in clinical pharmacy and education. Methods: The pharmacist examination in Taiwan consists of 2 stages: basic subjects and clinical subjects. In this study, exam questions were manually fed into the GPT-3.5 and GPT-4 models, and their responses were recorded; graphic-based questions were excluded. This study encompassed three steps: (1) determining the answering accuracy of GPT-3.5 and GPT-4, (2) categorizing question types and observing differences in model performance across these categories, and (3) comparing model performance on calculation and situational questions. Microsoft Excel and R software were used for statistical analyses. Results: GPT-4 achieved an accuracy rate of 72.9%, overshadowing GPT-3.5, which achieved 59.1% (P<.001). In the basic subjects category, GPT-4 significantly outperformed GPT-3.5 (73.4% vs 53.2%; P<.001). However, in clinical subjects, only minor differences in accuracy were observed. Specifically, GPT-4 outperformed GPT-3.5 in the calculation and situational questions. Conclusions: This study demonstrates that GPT-4 outperforms GPT-3.5 in the Taiwan National Pharmacist Licensing Examination, particularly in basic subjects. While GPT-4 shows potential for use in clinical practice and pharmacy education, its limitations warrant caution. Future research should focus on refining prompts, improving model stability, integrating medical databases, and designing questions that better assess student competence and minimize guessing. %R 10.2196/56850 %U https://mededu.jmir.org/2025/1/e56850 %U https://doi.org/10.2196/56850 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e64284 %T Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis %A Wei,Boxiong %K large language models %K LLM %K artificial intelligence %K AI %K GPT-4 %K radiology exams %K medical education %K diagnostics %K medical training %K radiology %K ultrasound %D 2025 %7 16.1.2025 %9 %J JMIR Med Educ %G English %X Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18‐0.60) for Claude, 0.24 (95% CI 0.13‐0.44) for Bard, and 0.25 (95% CI 0.14‐0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27‐0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology. %R 10.2196/64284 %U https://mededu.jmir.org/2025/1/e64284 %U https://doi.org/10.2196/64284 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58898 %T Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study %A Kaewboonlert,Naritsaret %A Poontananggul,Jiraphon %A Pongsuwan,Natthipong %A Bhakdisongkhram,Gun %K accuracy %K performance %K artificial intelligence %K AI %K ChatGPT %K large language model %K LLM %K difficulty index %K basic medical science examination %K cross-sectional study %K medical education %K datasets %K assessment %K medical science %K tool %K Google %D 2025 %7 13.1.2025 %9 %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. Objective: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. Methods: We used questions that were closely aligned with the content and topic distribution of Thailand’s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). Results: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%‐92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%‐87.80%), GPT-3.5 at 67.02% (95% CI 61.20%‐72.48%), and Google Bard at 63.83% (95% CI 57.92%‐69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. Conclusions: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item’s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts. %R 10.2196/58898 %U https://mededu.jmir.org/2025/1/e58898 %U https://doi.org/10.2196/58898 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e63731 %T Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study %A Zhu,Shiben %A Hu,Wanqin %A Yang,Zhi %A Yan,Jiani %A Zhang,Fang %+ Department of Science and Education, Shenzhen Baoan Women's and Children's Hospital, 56 Yulu Road, Xin'an Street, Bao'an District, Shenzhen, 518001, China, 86 13686891225, zhangfangf11@163.com %K large language models %K LLMs %K Chinese National Nursing Licensing Examination %K ChatGPT %K Qwen-2.5 %K multiple-choice questions %K %D 2025 %7 10.1.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training. %M 39793017 %R 10.2196/63731 %U https://medinform.jmir.org/2025/1/e63731 %U https://doi.org/10.2196/63731 %U http://www.ncbi.nlm.nih.gov/pubmed/39793017 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e63924 %T Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis %A Zhang,Yong %A Lu,Xiao %A Luo,Yan %A Zhu,Ying %A Ling,Wenwu %K chatbots %K ChatGPT %K ERNIE Bot %K performance %K accuracy rates %K ultrasound %K language %K examination %D 2025 %7 9.1.2025 %9 %J JMIR Med Inform %G English %X Background: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. Objective: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. Methods: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. Results: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot’s decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. Conclusions: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use. %R 10.2196/63924 %U https://medinform.jmir.org/2025/1/e63924 %U https://doi.org/10.2196/63924 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e63129 %T Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions %A Miyazaki,Yuki %A Hata,Masahiro %A Omori,Hisaki %A Hirashima,Atsuya %A Nakagawa,Yuta %A Eto,Mitsuhiro %A Takahashi,Shun %A Ikeda,Manabu %K medical education %K artificial intelligence %K clinical decision-making %K GPT-4o %K medical licensing examination %K Japan %K images %K accuracy %K AI technology %K application %K decision-making %K image-based %K reliability %K ChatGPT %D 2024 %7 24.12.2024 %9 %J JMIR Med Educ %G English %X This study evaluated the performance of ChatGPT with GPT-4 Omni (GPT-4o) on the 118th Japanese Medical Licensing Examination. The study focused on both text-only and image-based questions. The model demonstrated a high level of accuracy overall, with no significant difference in performance between text-only and image-based questions. Common errors included clinical judgment mistakes and prioritization issues, underscoring the need for further improvement in the integration of artificial intelligence into medical education and practice. %R 10.2196/63129 %U https://mededu.jmir.org/2024/1/e63129 %U https://doi.org/10.2196/63129 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56132 %T Long-Term Knowledge Retention of Biochemistry Among Medical Students in Riyadh, Saudi Arabia: Cross-Sectional Survey %A Mehyar,Nimer %A Awawdeh,Mohammed %A Omair,Aamir %A Aldawsari,Adi %A Alshudukhi,Abdullah %A Alzeer,Ahmed %A Almutairi,Khaled %A Alsultan,Sultan %K biochemistry %K knowledge %K retention %K medical students %K retention interval %K Saudi Arabia %D 2024 %7 16.12.2024 %9 %J JMIR Med Educ %G English %X Background: Biochemistry is a cornerstone of medical education. Its knowledge is integral to the understanding of complex biological processes and how they are applied in several areas in health care. Also, its significance is reflected in the way it informs the practice of medicine, which can guide and help in both diagnosis and treatment. However, the retention of biochemistry knowledge over time remains a dilemma. Long-term retention of such crucial information is extremely important, as it forms the foundation upon which clinical skills are developed and refined. The effectiveness of biochemistry education, and consequently its long-term retention, is influenced by several factors. Educational methods play a critical role; interactional and integrative teaching approaches have been suggested to enhance retention compared with traditional didactic methods. The frequency and context in which biochemistry knowledge is applied in clinical settings can significantly impact its retention. Practical application reinforces theoretical understanding, making the knowledge more accessible in the long term. Prior knowledge (familiarity) of information suggests that it is stored in long-term memory, which makes its retention in the long term easier to recall. Objectives: This investigation was conducted at King Saud bin Abdulaziz University for Health Sciences in Riyadh, Saudi Arabia. The aim of the study is to understand the dynamics of long-term retention of biochemistry among medical students. Specifically, it looks for the association between students’ familiarity with biochemistry content and actual knowledge retention levels. Methods: A cross-sectional correlational survey involving 240 students from King Saud bin Abdulaziz University for Health Sciences was conducted. Participants were recruited via nonprobability convenience sampling. A validated biochemistry assessment tool with 20 questions was used to gauge students’ retention in biomolecules, catalysis, bioenergetics, and metabolism. To assess students’ familiarity with the knowledge content of test questions, each question is accompanied by options that indicate students’ prior knowledge of the content of the question. Statistical analyses tests such as Mann-Whitney U test, Kruskal-Wallis test, and chi-square tests were used. Results: Our findings revealed a significant correlation between students’ familiarity of the content with their knowledge retention in the biomolecules (r=0.491; P<.001), catalysis (r=0.500; P<.001), bioenergetics (r=0.528; P<.001), and metabolism (r=0.564; P<.001) biochemistry knowledge domains. Conclusions: This study highlights the significance of familiarity (prior knowledge) in evaluating the retention of biochemistry knowledge. Although limited in terms of generalizability and inherent biases, the research highlights the crucial significance of student’s familiarity in actual knowledge retention of several biochemistry domains. These results might be used by educators to customize instructional methods in order to improve students’ long-term retention of biochemistry information and boost their clinical performance. %R 10.2196/56132 %U https://mededu.jmir.org/2024/1/e56132 %U https://doi.org/10.2196/56132 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52068 %T Evaluation of a Computer-Based Morphological Analysis Method for Free-Text Responses in the General Medicine In-Training Examination: Algorithm Validation Study %A Yokokawa,Daiki %A Shikino,Kiyoshi %A Nishizaki,Yuji %A Fukui,Sho %A Tokuda,Yasuharu %K General Medicine In-Training Examination %K free-text response %K morphological analysis %K Situation, Background, Assessment, and Recommendation %K video-based question %D 2024 %7 5.12.2024 %9 %J JMIR Med Educ %G English %X Background: The General Medicine In-Training Examination (GM-ITE) tests clinical knowledge in a 2-year postgraduate residency program in Japan. In the academic year 2021, as a domain of medical safety, the GM-ITE included questions regarding the diagnosis from medical history and physical findings through video viewing and the skills in presenting a case. Examinees watched a video or audio recording of a patient examination and provided free-text responses. However, the human cost of scoring free-text answers may limit the implementation of GM-ITE. A simple morphological analysis and word-matching model, thus, can be used to score free-text responses. Objective: This study aimed to compare human versus computer scoring of free-text responses and qualitatively evaluate the discrepancies between human- and machine-generated scores to assess the efficacy of machine scoring. Methods: After obtaining consent for participation in the study, the authors used text data from residents who voluntarily answered the GM-ITE patient reproduction video-based questions involving simulated patients. The GM-ITE used video-based questions to simulate a patient’s consultation in the emergency room with a diagnosis of pulmonary embolism following a fracture. Residents provided statements for the case presentation. We obtained human-generated scores by collating the results of 2 independent scorers and machine-generated scores by converting the free-text responses into a word sequence through segmentation and morphological analysis and matching them with a prepared list of correct answers in 2022. Results: Of the 104 responses collected—63 for postgraduate year 1 and 41 for postgraduate year 2—39 cases remained for final analysis after excluding invalid responses. The authors found discrepancies between human and machine scoring in 14 questions (7.2%); some were due to shortcomings in machine scoring that could be resolved by maintaining a list of correct words and dictionaries, whereas others were due to human error. Conclusions: Machine scoring is comparable to human scoring. It requires a simple program and calibration but can potentially reduce the cost of scoring free-text responses. %R 10.2196/52068 %U https://mededu.jmir.org/2024/1/e52068 %U https://doi.org/10.2196/52068 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e59902 %T Performance Comparison of Junior Residents and ChatGPT in the Objective Structured Clinical Examination (OSCE) for Medical History Taking and Documentation of Medical Records: Development and Usability Study %A Huang,Ting-Yun %A Hsieh,Pei Hsing %A Chang,Yung-Chun %K large language model %K medical history taking %K clinical documentation %K simulation-based evaluation %K OSCE standards %K LLM %D 2024 %7 21.11.2024 %9 %J JMIR Med Educ %G English %X Background: This study explores the cutting-edge abilities of large language models (LLMs) such as ChatGPT in medical history taking and medical record documentation, with a focus on their practical effectiveness in clinical settings—an area vital for the progress of medical artificial intelligence. Objective: Our aim was to assess the capability of ChatGPT versions 3.5 and 4.0 in performing medical history taking and medical record documentation in simulated clinical environments. The study compared the performance of nonmedical individuals using ChatGPT with that of junior medical residents. Methods: A simulation involving standardized patients was designed to mimic authentic medical history–taking interactions. Five nonmedical participants used ChatGPT versions 3.5 and 4.0 to conduct medical histories and document medical records, mirroring the tasks performed by 5 junior residents in identical scenarios. A total of 10 diverse scenarios were examined. Results: Evaluation of the medical documentation created by laypersons with ChatGPT assistance and those created by junior residents was conducted by 2 senior emergency physicians using audio recordings and the final medical records. The assessment used the Objective Structured Clinical Examination benchmarks in Taiwan as a reference. ChatGPT-4.0 exhibited substantial enhancements over its predecessor and met or exceeded the performance of human counterparts in terms of both checklist and global assessment scores. Although the overall quality of human consultations remained higher, ChatGPT-4.0’s proficiency in medical documentation was notably promising. Conclusions: The performance of ChatGPT 4.0 was on par with that of human participants in Objective Structured Clinical Examination evaluations, signifying its potential in medical history and medical record documentation. Despite this, the superiority of human consultations in terms of quality was evident. The study underscores both the promise and the current limitations of LLMs in the realm of clinical practice. %R 10.2196/59902 %U https://mededu.jmir.org/2024/1/e59902 %U https://doi.org/10.2196/59902 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 10 %N %P e62887 %T Competence and Training Needs in Infectious Disease Emergency Response Among Chinese Nurses: Cross-Sectional Study %A Zhang,Dandan %A Chen,Yong-Jun %A Cui,Tianxin %A Zhang,Jianzhong %A Chen,Si-Ying %A Zhang,Yin-Ping %K competence %K preparedness %K infectious disease emergency %K Chinese %K nurse %K cross-sectional study %K COVID-19 %K pandemic %K public health %K health crises %K emergency response %K emergency preparedness %K medical institution %K health care worker %K linear regression %D 2024 %7 18.11.2024 %9 %J JMIR Public Health Surveill %G English %X Background: In recent years, the frequent outbreaks of infectious diseases and insufficient emergency response capabilities, particularly issues exposed during the COVID-19 pandemic, have underscored the critical role of nurses in addressing public health crises. It is currently necessary to investigate the emergency preparedness of nursing personnel following the COVID-19 pandemic completely liberalized, aiming to identify weaknesses and optimize response strategies. Objective: This study aimed to assess the emergency response competence of nurses, identify their specific training needs, and explore the various elements that impact their emergency response competence. Methods: Using a multistage stratified sampling method, 5 provinces from different geographical locations nationwide were initially randomly selected using random number tables. Subsequently, within each province, 2 tertiary hospitals, 4 secondary hospitals, and 10 primary hospitals were randomly selected for the survey. The random selection and stratification of the hospitals took into account various aspects such as geographical locations, different levels, scale, and number of nurses. This study involved 80 hospitals (including 10 tertiary hospitals, 20 secondary hospitals, and 50 primary hospitals), where nurses from different departments, specialties, and age groups anonymously completed a questionnaire on infectious disease emergency response capabilities. Results: This study involved 2055 participants representing various health care institutions. The nurses’ mean score in infectious disease emergency response competence was 141.75 (SD 20.09), indicating a moderate to above-average level. Nearly one-fifth (n=397, 19.32%) of nurses have experience in responding to infectious disease emergencies; however, they acknowledge a lack of insufficient drills (n=615,29.93%) and training (n=502,24.43%). Notably, 1874 (91.19%) nurses expressed a willingness to undergo further training. Multiple linear regression analysis indicated that significant factors affecting infectious disease emergency response competence included the highest degree, frequency of drills and training, and the willingness to undertake further training (B=−11.455, 7.344, 11.639, 14.432, 10.255, 7.364, and −11.216; all P<.05). Notably, a higher frequency of participation in drills and training sessions correlated with better outcomes (P<.001 or P<.05). Nurses holding a master degree or higher demonstrated significantly lower competence scores in responding to infectious diseases compared with nurses with a diploma or associate degree (P=.001). Approximately 1644 (80%) of the nurses preferred training lasting from 3 days to 1 week, with scenario simulations and emergency drills considered the most popular training methods. Conclusions: These findings highlight the potential and need for nurses with infectious disease emergency response competence. Frequent drills and training will significantly enhance response competence; however, a lack of practical experience in higher education may have a negative impact on emergency performance. The study emphasizes the critical need for personalized training to boost nurses’ abilities, especially through short-term, intensive methods and simulation drills. Further training and tailored plans are essential to improve nurses’ overall proficiency and ensure effective responses to infectious disease emergencies. %R 10.2196/62887 %U https://publichealth.jmir.org/2024/1/e62887 %U https://doi.org/10.2196/62887 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e63430 %T ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis %A Bicknell,Brenton T %A Butler,Danner %A Whalen,Sydney %A Ricks,James %A Dixon,Cory J %A Clark,Abigail B %A Spaedy,Olivia %A Skelton,Adam %A Edupuganti,Neel %A Dzubinski,Lance %A Tate,Hudson %A Dyess,Garrett %A Lindeman,Brenessa %A Lehmann,Lisa Soleymani %K large language model %K ChatGPT %K medical education %K USMLE %K AI in medical education %K medical student resources %K educational technology %K artificial intelligence in medicine %K clinical skills %K LLM %K medical licensing examination %K medical students %K United States Medical Licensing Examination %K ChatGPT 4 Omni %K ChatGPT 4 %K ChatGPT 3.5 %D 2024 %7 6.11.2024 %9 %J JMIR Med Educ %G English %X Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models’ performances. Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o’s highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o’s diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3‐60.3). Conclusions: GPT-4o’s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness. %R 10.2196/63430 %U https://mededu.jmir.org/2024/1/e63430 %U https://doi.org/10.2196/63430 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e53151 %T Evaluating the Effectiveness of an Online Course on Pediatric Malnutrition for Syrian Health Professionals: Qualitative Delphi Study %A Sahyouni,Amal %A Zoukar,Imad %A Dashash,Mayssoon %K effectiveness %K online course %K pediatric %K malnutrition %K essential competencies %K e-learning %K health professional %K Syria %K pilot study %K acquisition knowledge %D 2024 %7 28.10.2024 %9 %J JMIR Med Educ %G English %X Background: There is a shortage of competent health professionals in managing malnutrition. Online education may be a practical and flexible approach to address this gap. Objective: This study aimed to identify essential competencies and assess the effectiveness of an online course on pediatric malnutrition in improving the knowledge of pediatricians and health professionals. Methods: A focus group (n=5) and Delphi technique (n=21 health professionals) were used to identify 68 essential competencies. An online course consisting of 4 educational modules in Microsoft PowerPoint (Microsoft Corp) slide form with visual aids (photos and videos) was designed and published on the Syrian Virtual University platform website using an asynchronous e-learning system. The course covered definition, classification, epidemiology, anthropometrics, treatment, and consequences. Participants (n=10) completed a pretest of 40 multiple-choice questions, accessed the course, completed a posttest after a specified period, and filled out a questionnaire to measure their attitude and assess their satisfaction. Results: A total of 68 essential competencies were identified, categorized into 3 domains: knowledge (24 competencies), skills (29 competencies), and attitudes (15 competencies). These competencies were further classified based on their focus area: etiology (10 competencies), assessment and diagnosis (21 competencies), and management (37 competencies). Further, 10 volunteers, consisting of 5 pediatricians and 5 health professionals, participated in this study over a 2-week period. A statistically significant increase in knowledge was observed among participants following completion of the online course (pretest mean 24.2, SD 6.1, and posttest mean 35.2, SD 3.3; P<.001). Pediatricians demonstrated higher pre- and posttest scores compared to other health care professionals (all P values were <.05). Prior malnutrition training within the past year positively impacted pretest scores (P=.03). Participants highly rated the course (mean satisfaction score >3.0 on a 5-point Likert scale), with 60% (6/10) favoring a blended learning approach. Conclusions: In total, 68 essential competencies are required for pediatricians to manage children who are malnourished. The online course effectively improved knowledge acquisition among health care professionals, with high participant satisfaction and approval of the e-learning environment. %R 10.2196/53151 %U https://mededu.jmir.org/2024/1/e53151 %U https://doi.org/10.2196/53151 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56128 %T Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study %A Goodings,Anthony James %A Kajitani,Sten %A Chhor,Allison %A Albakri,Ahmad %A Pastrak,Mila %A Kodancha,Megha %A Ives,Rowan %A Lee,Yoo Bin %A Kajitani,Kari %K ChatGPT-4 %K Family Medicine Board Examination %K artificial intelligence in medical education %K AI performance assessment %K prompt engineering %K ChatGPT %K artificial intelligence %K AI %K medical education %K assessment %K observational %K analytical method %K data analysis %K examination %D 2024 %7 8.10.2024 %9 %J JMIR Med Educ %G English %X Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, “AI Family Medicine Board Exam Taker,” designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI’s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4’s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4’s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. %R 10.2196/56128 %U https://mededu.jmir.org/2024/1/e56128 %U https://doi.org/10.2196/56128 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52746 %T Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study %A Wu,Zelin %A Gan,Wenyi %A Xue,Zhaowen %A Ni,Zhengxin %A Zheng,Xiaofei %A Zhang,Yiyi %K artificial intelligence %K ChatGPT %K nursing licensure examination %K nursing %K LLMs %K large language models %K nursing education %K AI %K nursing student %K large language model %K licensing %K observation %K observational study %K China %K USA %K United States of America %K auxiliary tool %K accuracy rate %K theoretical %D 2024 %7 3.10.2024 %9 %J JMIR Med Educ %G English %X Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT’s performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5’s Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making. %R 10.2196/52746 %U https://mededu.jmir.org/2024/1/e52746 %U https://doi.org/10.2196/52746 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e53314 %T Creation of an Automated and Comprehensive Resident Progress System for Residents and to Save Hours of Faculty Time: Mixed Methods Study %A Perotte,Rimma %A Berns,Alyssa %A Shaker,Lana %A Ophaswongse,Chayapol %A Underwood,Joseph %A Hajicharalambous,Christina %+ Hackensack University Medical Center, 30 Prospect Ave, Hackensack, NJ, 07601, United States, 1 5519962470, rimma.perotte@hmhn.org %K progress dashboard %K informatics in medical education %K residency learning management system %K residency progress system %K residency education system %K summarization %K administrative burden %K medical education %K resident %K residency %K resident data %K longitudinal %K pilot study %K competency %K dashboards %K dashboard %K faculty %K residents %D 2024 %7 23.9.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: It is vital for residents to have a longitudinal view of their educational progression, and it is crucial for the medical education team to have a clear way to track resident progress over time. Current tools for aggregating resident data are difficult to use and do not provide a comprehensive way to evaluate and display resident educational advancement. Objective: This study aims to describe the creation and assessment of a system designed to improve the longitudinal presentation, quality, and synthesis of educational progress for trainees. We created a new system for residency progress management with 3 goals in mind, that are (1) a long-term and centralized location for residency education data, (2) a clear and intuitive interface that is easy to access for both the residents and faculty involved in medical education, and (3) automated data input, transformation, and analysis. We present evaluations regarding whether residents find the system useful, and whether faculty like the system and perceive that it helps them save time with administrative duties. Methods: The system was created using a suite of Google Workspace tools including Forms, Sheets, Gmail, and a collection of Apps Scripts triggered at various times and events. To assess whether the system had an effect on the residents, we surveyed and asked them to self-report on how often they accessed the system and interviewed them as to whether they found it useful. To understand what the faculty thought of the system, we conducted a 14-person focus group and asked the faculty to self-report their time spent preparing for residency progress meetings before and after the system debut. Results: The system went live in February 2022 as a quality improvement project, evolving through multiple iterations of feedback. The authors found that the system was accessed differently by different postgraduate years (PGY), with the most usage reported in the PGY1 class (weekly), and the least amount of usage in the PGY3 class (once or twice). However, all of the residents reported finding the system useful, specifically for aggregating all of their evaluations in the same place. Faculty members felt that the system enabled a more high-quality biannual clinical competency committee meeting and they reported a combined time savings of 8 hours in preparation for each clinical competency committee as a result of reviewing resident data through the system. Conclusions: Our study reports on the creation of an automated, instantaneous, and comprehensive resident progress management system. The system has been shown to be well-liked by both residents and faculty. Younger PGY classes reported more frequent system usage than older PGY classes. Faculty reported that it helped facilitate more meaningful discussion of training progression and reduced the administrative burden by 8 hours per biannual session. %M 39312292 %R 10.2196/53314 %U https://formative.jmir.org/2024/1/e53314 %U https://doi.org/10.2196/53314 %U http://www.ncbi.nlm.nih.gov/pubmed/39312292 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e58753 %T Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial %A Yamamoto,Akira %A Koda,Masahide %A Ogawa,Hiroko %A Miyoshi,Tomoko %A Maeda,Yoshinobu %A Otsuka,Fumio %A Ino,Hideo %+ Department of Hematology and Oncology, Okayama University Hospital, Okayama, Japan, 2-5-1 Shikata-cho, Kita-ku, Okayama, 700-8558, Japan, 81 86 235 7342, ymtakira@gmail.com %K medical interview %K generative pretrained transformer %K large language model %K simulation-based learning %K OSCE %K artificial intelligence %K medical education %K simulated patients %K nonrandomized controlled trial %D 2024 %7 23.9.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Medical interviewing is a critical skill in clinical practice, yet opportunities for practical training are limited in Japanese medical schools, necessitating urgent measures. Given advancements in artificial intelligence (AI) technology, its application in the medical field is expanding. However, reports on its application in medical interviews in medical education are scarce. Objective: This study aimed to investigate whether medical students’ interview skills could be improved by engaging with AI-simulated patients using large language models, including the provision of feedback. Methods: This nonrandomized controlled trial was conducted with fourth-year medical students in Japan. A simulation program using large language models was provided to 35 students in the intervention group in 2023, while 110 students from 2022 who did not participate in the intervention were selected as the control group. The primary outcome was the score on the Pre-Clinical Clerkship Objective Structured Clinical Examination (pre-CC OSCE), a national standardized clinical skills examination, in medical interviewing. Secondary outcomes included surveys such as the Simulation-Based Training Quality Assurance Tool (SBT-QA10), administered at the start and end of the study. Results: The AI intervention group showed significantly higher scores on medical interviews than the control group (AI group vs control group: mean 28.1, SD 1.6 vs 27.1, SD 2.2; P=.01). There was a trend of inverse correlation between the SBT-QA10 and pre-CC OSCE scores (regression coefficient –2.0 to –2.1). No significant safety concerns were observed. Conclusions: Education through medical interviews using AI-simulated patients has demonstrated safety and a certain level of educational effectiveness. However, at present, the educational effects of this platform on nonverbal communication skills are limited, suggesting that it should be used as a supplementary tool to traditional simulation education. %M 39312284 %R 10.2196/58753 %U https://mededu.jmir.org/2024/1/e58753 %U https://doi.org/10.2196/58753 %U http://www.ncbi.nlm.nih.gov/pubmed/39312284 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56859 %T Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study %A Yoon,Soo-Hyuk %A Oh,Seok Kyeong %A Lim,Byung Gun %A Lee,Ho-Jin %+ Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Daehak-ro 101, Jongno-gu, Seoul, 03080, Republic of Korea, 82 220720039, hjpainfree@snu.ac.kr %K AI tools %K problem solving %K anesthesiology %K artificial intelligence %K pain medicine %K ChatGPT %K health care %K medical education %K South Korea %D 2024 %7 16.9.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4’s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. %M 39284182 %R 10.2196/56859 %U https://mededu.jmir.org/2024/1/e56859 %U https://doi.org/10.2196/56859 %U http://www.ncbi.nlm.nih.gov/pubmed/39284182 %0 Journal Article %@ 2562-7600 %I JMIR Publications %V 7 %N %P e48810 %T Experiences of Using a Digital Guidance and Assessment Tool (the Technology-Optimized Practice Process in Nursing Application) During Clinical Practice in a Nursing Home: Focus Group Study Among Nursing Students %A Johnsen,Hege Mari %A Nes,Andréa Aparecida Gonçalves %A Haddeland,Kristine %+ Department of Health and Nursing Science, University of Agder, Jon Lilletuns vei 9, Grimstad, 4879, Norway, 47 97515773, hege.mari.johnsen@uia.no %K application %K assessment of clinical education %K AssCE %K clinical education assessment tool %K electronic reports %K feedback %K guidance model %K smartphone %K Technology-Optimized Practice Process in Nursing %K TOPP-N %K information system success model %K nurse %K nursing %K allied health %K education %K focus group %K focus groups %K technology enhanced learning %K digital health %K content analysis %K student %K students %K nursing home %K long-term care %K learning management %K mobile phone %D 2024 %7 10.9.2024 %9 Original Paper %J JMIR Nursing %G English %X Background: Nursing students’ learning during clinical practice is largely influenced by the quality of the guidance they receive from their nurse preceptors. Students that have attended placement in nursing home settings have called for more time with nurse preceptors and an opportunity for more help from the nurses for reflection and developing critical thinking skills. To strengthen students’ guidance and assessment and enhance students’ learning in the practice setting, it has also been recommended to improve the collaboration between faculties and nurse preceptors. Objective: This study explores first-year nursing students’ experiences of using the Technology-Optimized Practice Process in Nursing (TOPP-N) application in 4 nursing homes in Norway. TOPP-N was developed to support guidance and assessment in clinical practice in nursing education. Methods: Four focus groups were conducted with 19 nursing students from 2 university campuses in Norway. The data collection and directed content analysis were based on DeLone and McLean’s information system success model. Results: Some participants had difficulties learning to use the TOPP-N tool, particularly those who had not attended the 1-hour digital course. Furthermore, participants remarked that the content of the TOPP-N guidance module could be better adjusted to the current clinical placement, level of education, and individual achievements to be more usable. Despite this, most participants liked the TOPP-N application’s concept. Using the TOPP-N mobile app for guidance and assessment was found to be very flexible. The frequency and ways of using the application varied among the participants. Most participants perceived that the use of TOPP-N facilitated awareness of learning objectives and enabled continuous reflection and feedback from nurse preceptors. However, the findings indicate that the TOPP-N application’s perceived usefulness was highly dependent on the preparedness and use of the app among nurse preceptors (or absence thereof). Conclusions: This study offers information about critical success factors perceived by nursing students related to the use of the TOPP-N application. To develop similar learning management systems that are usable and efficient, developers should focus on personalizing the content, clarifying procedures for use, and enhancing the training and motivation of users, that is, students, nurse preceptors, and educators. %M 39255477 %R 10.2196/48810 %U https://nursing.jmir.org/2024/1/e48810 %U https://doi.org/10.2196/48810 %U http://www.ncbi.nlm.nih.gov/pubmed/39255477 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50545 %T Integration of ChatGPT Into a Course for Medical Students: Explorative Study on Teaching Scenarios, Students’ Perception, and Applications %A Thomae,Anita V %A Witt,Claudia M %A Barth,Jürgen %K medical education %K ChatGPT %K artificial intelligence %K information for patients %K critical appraisal %K evaluation %K blended learning %K AI %K digital skills %K teaching %D 2024 %7 22.8.2024 %9 %J JMIR Med Educ %G English %X Background: Text-generating artificial intelligence (AI) such as ChatGPT offers many opportunities and challenges in medical education. Acquiring practical skills necessary for using AI in a clinical context is crucial, especially for medical education. Objective: This explorative study aimed to investigate the feasibility of integrating ChatGPT into teaching units and to evaluate the course and the importance of AI-related competencies for medical students. Since a possible application of ChatGPT in the medical field could be the generation of information for patients, we further investigated how such information is perceived by students in terms of persuasiveness and quality. Methods: ChatGPT was integrated into 3 different teaching units of a blended learning course for medical students. Using a mixed methods approach, quantitative and qualitative data were collected. As baseline data, we assessed students’ characteristics, including their openness to digital innovation. The students evaluated the integration of ChatGPT into the course and shared their thoughts regarding the future of text-generating AI in medical education. The course was evaluated based on the Kirkpatrick Model, with satisfaction, learning progress, and applicable knowledge considered as key assessment levels. In ChatGPT-integrating teaching units, students evaluated videos featuring information for patients regarding their persuasiveness on treatment expectations in a self-experience experiment and critically reviewed information for patients written using ChatGPT 3.5 based on different prompts. Results: A total of 52 medical students participated in the study. The comprehensive evaluation of the course revealed elevated levels of satisfaction, learning progress, and applicability specifically in relation to the ChatGPT-integrating teaching units. Furthermore, all evaluation levels demonstrated an association with each other. Higher openness to digital innovation was associated with higher satisfaction and, to a lesser extent, with higher applicability. AI-related competencies in other courses of the medical curriculum were perceived as highly important by medical students. Qualitative analysis highlighted potential use cases of ChatGPT in teaching and learning. In ChatGPT-integrating teaching units, students rated information for patients generated using a basic ChatGPT prompt as “moderate” in terms of comprehensibility, patient safety, and the correct application of communication rules taught during the course. The students’ ratings were considerably improved using an extended prompt. The same text, however, showed the smallest increase in treatment expectations when compared with information provided by humans (patient, clinician, and expert) via videos. Conclusions: This study offers valuable insights into integrating the development of AI competencies into a blended learning course. Integration of ChatGPT enhanced learning experiences for medical students. %R 10.2196/50545 %U https://mededu.jmir.org/2024/1/e50545 %U https://doi.org/10.2196/50545 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57037 %T Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial %A Gan,Wenyi %A Ouyang,Jianfeng %A Li,Hua %A Xue,Zhaowen %A Zhang,Yiming %A Dong,Qiu %A Huang,Jiadong %A Zheng,Xiaofei %A Zhang,Yiyi %+ The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, No. 613, Huangpu Avenue West, Tianhe District, Guangzhou, 510630, China, 86 130 76855735, yiyizjun@126.com %K ChatGPT %K medical education %K orthopedics %K artificial intelligence %K large language model %K natural language processing %K randomized controlled trial %K learning aid %D 2024 %7 20.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT is a natural language processing model developed by OpenAI, which can be iteratively updated and optimized to accommodate the changing and complex requirements of human verbal communication. Objective: The study aimed to evaluate ChatGPT’s accuracy in answering orthopedics-related multiple-choice questions (MCQs) and assess its short-term effects as a learning aid through a randomized controlled trial. In addition, long-term effects on student performance in other subjects were measured using final examination results. Methods: We first evaluated ChatGPT’s accuracy in answering MCQs pertaining to orthopedics across various question formats. Then, 129 undergraduate medical students participated in a randomized controlled study in which the ChatGPT group used ChatGPT as a learning tool, while the control group was prohibited from using artificial intelligence software to support learning. Following a 2-week intervention, the 2 groups’ understanding of orthopedics was assessed by an orthopedics test, and variations in the 2 groups’ performance in other disciplines were noted through a follow-up at the end of the semester. Results: ChatGPT-4.0 answered 1051 orthopedics-related MCQs with a 70.60% (742/1051) accuracy rate, including 71.8% (237/330) accuracy for A1 MCQs, 73.7% (330/448) accuracy for A2 MCQs, 70.2% (92/131) accuracy for A3/4 MCQs, and 58.5% (83/142) accuracy for case analysis MCQs. As of April 7, 2023, a total of 129 individuals participated in the experiment. However, 19 individuals withdrew from the experiment at various phases; thus, as of July 1, 2023, a total of 110 individuals accomplished the trial and completed all follow-up work. After we intervened in the learning style of the students in the short term, the ChatGPT group answered more questions correctly than the control group (ChatGPT group: mean 141.20, SD 26.68; control group: mean 130.80, SD 25.56; P=.04) in the orthopedics test, particularly on A1 (ChatGPT group: mean 46.57, SD 8.52; control group: mean 42.18, SD 9.43; P=.01), A2 (ChatGPT group: mean 60.59, SD 10.58; control group: mean 56.66, SD 9.91; P=.047), and A3/4 MCQs (ChatGPT group: mean 19.57, SD 5.48; control group: mean 16.46, SD 4.58; P=.002). At the end of the semester, we found that the ChatGPT group performed better on final examinations in surgery (ChatGPT group: mean 76.54, SD 9.79; control group: mean 72.54, SD 8.11; P=.02) and obstetrics and gynecology (ChatGPT group: mean 75.98, SD 8.94; control group: mean 72.54, SD 8.66; P=.04) than the control group. Conclusions: ChatGPT answers orthopedics-related MCQs accurately, and students using it excel in both short-term and long-term assessments. Our findings strongly support ChatGPT’s integration into medical education, enhancing contemporary instructional methods. Trial Registration: Chinese Clinical Trial Registry Chictr2300071774; https://www.chictr.org.cn/hvshowproject.html ?id=225740&v=1.0 %M 39163598 %R 10.2196/57037 %U https://www.jmir.org/2024/1/e57037 %U https://doi.org/10.2196/57037 %U http://www.ncbi.nlm.nih.gov/pubmed/39163598 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52784 %T Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study %A Ming,Shuai %A Guo,Qingge %A Cheng,Wenjun %A Lei,Bo %K ChatGPT %K Chinese National Medical Licensing Examination %K large language models %K medical education %K system role %K LLM %K LLMs %K language model %K language models %K artificial intelligence %K chatbot %K chatbots %K conversational agent %K conversational agents %K exam %K exams %K examination %K examinations %K OpenAI %K answer %K answers %K response %K responses %K accuracy %K performance %K China %K Chinese %D 2024 %7 13.8.2024 %9 %J JMIR Med Educ %G English %X Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%‐3.7%) and GPT-3.5 (1.3%‐4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model’s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. %R 10.2196/52784 %U https://mededu.jmir.org/2024/1/e52784 %U https://doi.org/10.2196/52784 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56342 %T Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study %A Burke,Harry B %A Hoang,Albert %A Lopreiato,Joseph O %A King,Heidi %A Hemmer,Paul %A Montgomery,Michael %A Gagarin,Viktoria %K medical education %K generative artificial intelligence %K natural language processing %K ChatGPT %K generative pretrained transformer %K standardized patients %K clinical notes %K free-text notes %K history and physical examination %K large language model %K LLM %K medical student %K medical students %K clinical information %K artificial intelligence %K AI %K patients %K patient %K medicine %D 2024 %7 25.7.2024 %9 %J JMIR Med Educ %G English %X Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students’ free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students’ notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students’ standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. %R 10.2196/56342 %U https://mededu.jmir.org/2024/1/e56342 %U https://doi.org/10.2196/56342 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52818 %T Appraisal of ChatGPT’s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination %A Cherif,Hela %A Moussa,Chirine %A Missaoui,Abdel Mouhaymen %A Salouage,Issam %A Mokaddem,Salma %A Dhahri,Besma %+ Faculté de Médecine de Tunis, Université de Tunis El Manar, 15, Rue Djebel Lakhdhar – Bab Saadoun, Tunis, 1007, Tunisia, 216 50424534, hela.cherif@fmt.utm.tn %K medical education %K ChatGPT %K GPT %K artificial intelligence %K natural language processing %K NLP %K pulmonary medicine %K pulmonary %K lung %K lungs %K respiratory %K respiration %K pneumology %K comparative analysis %K large language models %K LLMs %K LLM %K language model %K generative AI %K generative artificial intelligence %K generative %K exams %K exam %K examinations %K examination %D 2024 %7 23.7.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT’s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution’s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. %M 39042876 %R 10.2196/52818 %U https://mededu.jmir.org/2024/1/e52818 %U https://doi.org/10.2196/52818 %U http://www.ncbi.nlm.nih.gov/pubmed/39042876 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e58126 %T Use of Multiple-Choice Items in Summative Examinations: Questionnaire Survey Among German Undergraduate Dental Training Programs %A Rössler,Lena %A Herrmann,Manfred %A Wiegand,Annette %A Kanzow,Philipp %K alternate-choice %K assessment %K best-answer %K dental %K dental schools %K dental training %K education %K educational assessment %K educational measurement %K examination %K German %K Germany %K k of n %K Kprim %K K’ %K medical education %K medical student %K MTF %K Multiple-True-False %K multiple choice %K multiple-select %K Pick-N %K scoring %K scoring system %K single choice %K single response %K test %K testing %K true/false %K true-false %K Type A %K Type K %K Type K’ %K Type R %K Type X %K undergraduate %K undergraduate curriculum %K undergraduate education %D 2024 %7 27.6.2024 %9 %J JMIR Med Educ %G English %X Background: Multiple-choice examinations are frequently used in German dental schools. However, details regarding the used item types and applied scoring methods are lacking. Objective: This study aims to gain insight into the current use of multiple-choice items (ie, questions) in summative examinations in German undergraduate dental training programs. Methods: A paper-based 10-item questionnaire regarding the used assessment methods, multiple-choice item types, and applied scoring methods was designed. The pilot-tested questionnaire was mailed to the deans of studies and to the heads of the Department of Operative/Restorative Dentistry at all 30 dental schools in Germany in February 2023. Statistical analysis was performed using the Fisher exact test (P<.05). Results: The response rate amounted to 90% (27/30 dental schools). All respondent dental schools used multiple-choice examinations for summative assessments. Examinations were delivered electronically by 70% (19/27) of the dental schools. Almost all dental schools used single-choice Type A items (24/27, 89%), which accounted for the largest number of items in approximately half of the dental schools (13/27, 48%). Further item types (eg, conventional multiple-select items, Multiple-True-False, and Pick-N) were only used by fewer dental schools (≤67%, up to 18 out of 27 dental schools). For the multiple-select item types, the applied scoring methods varied considerably (ie, awarding [intermediate] partial credit and requirements for partial credit). Dental schools with the possibility of electronic examinations used multiple-select items slightly more often (14/19, 74% vs 4/8, 50%). However, this difference was statistically not significant (P=.38). Dental schools used items either individually or as key feature problems consisting of a clinical case scenario followed by a number of items focusing on critical treatment steps (15/27, 56%). Not a single school used alternative testing methods (eg, answer-until-correct). A formal item review process was established at about half of the dental schools (15/27, 56%). Conclusions: Summative assessment methods among German dental schools vary widely. Especially, a large variability regarding the use and scoring of multiple-select multiple-choice items was found. %R 10.2196/58126 %U https://mededu.jmir.org/2024/1/e58126 %U https://doi.org/10.2196/58126 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e56117 %T A Use Case for Generative AI in Medical Education %A Sekhar,Tejas C %A Nayak,Yash R %A Abdoler,Emily A %K medical education %K med ed %K generative artificial intelligence %K artificial intelligence %K GAI %K AI %K Anki %K flashcard %K undergraduate medical education %K UME %D 2024 %7 7.6.2024 %9 %J JMIR Med Educ %G English %X %R 10.2196/56117 %U https://mededu.jmir.org/2024/1/e56117 %U https://doi.org/10.2196/56117 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e58370 %T Authors’ Reply: A Use Case for Generative AI in Medical Education %A Pendergrast,Tricia %A Chalmers,Zachary %K ChatGPT %K undergraduate medical education %K large language models %D 2024 %7 7.6.2024 %9 %J JMIR Med Educ %G English %X %R 10.2196/58370 %U https://mededu.jmir.org/2024/1/e58370 %U https://doi.org/10.2196/58370 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e55898 %T Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study %A Lambert,Raphaella %A Choo,Zi-Yi %A Gradwohl,Kelsey %A Schroedl,Liesl %A Ruiz De Luzuriaga,Arlene %+ Pritzker School of Medicine, University of Chicago, 924 East 57th Street #104, Chicago, IL, 60637, United States, 1 7737021937, aleksalambert@uchicagomedicine.org %K artificial intelligence %K large language models %K large language model %K LLM %K LLMs %K machine learning %K natural language processing %K deep learning %K ChatGPT %K health literacy %K health knowledge %K health information %K patient education %K dermatology %K dermatologist %K dermatologists %K derm %K dermatology resident %K dermatology residents %K dermatologic patient education material %K dermatologic patient education materials %K patient education material %K patient education materials %K education material %K education materials %D 2024 %7 16.5.2024 %9 Original Paper %J JMIR Dermatol %G English %X Background: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels. Objective: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees. Methods: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to “Create a patient education handout about [condition] at a [FKRL]” to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees. Results: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%). Conclusions: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology. %M 38754096 %R 10.2196/55898 %U https://derma.jmir.org/2024/1/e55898 %U https://doi.org/10.2196/55898 %U http://www.ncbi.nlm.nih.gov/pubmed/38754096 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e56005 %T Training Family Medicine Residents in Dermoscopy Using an e-Learning Course: Pilot Interventional Study %A Friche,Pauline %A Moulis,Lionel %A Du Thanh,Aurélie %A Dereure,Olivier %A Duflos,Claire %A Carbonnel,Francois %+ Desbrest Institute of Epidemiology and Public Health, Unité Mixte de Recherche, Unité d'accueil 11, University of Montpellier, Institut national de la santé et de la recherche médicale, Camps ADV, IURC, 641 Avenue du Doyen Gaston Giraud, Montpellier, 34093, France, 33 684014834, Francois.carbonnel@umontpellier.fr %K dermoscopy %K dermatoscope %K dermatoscopes %K dermatological %K skin %K training %K GP %K family practitioner %K family practitioners %K family physician %K family physicians %K general practice %K family medicine %K primary health care %K internship and residency %K education %K e-learning %K eLearning %K dermatology %K resident %K residency %K intern %K interns %K internship %K internships %D 2024 %7 13.5.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Skin cancers are the most common group of cancers diagnosed worldwide. Aging and sun exposure increase their risk. The decline in the number of dermatologists is pushing the issue of dermatological screening back onto family doctors. Dermoscopy is an easy-to-use tool that increases the sensitivity of melanoma diagnosis by 60% to 90%, but its use is limited due to lack of training. The characteristics of “ideal” dermoscopy training have yet to be established. We created a Moodle (Moodle HQ)-based e-learning course to train family medicine residents in dermoscopy. Objective: This study aimed to evaluate the evolution of dermoscopy knowledge among family doctors immediately and 1 and 3 months after e-learning training. Methods: We conducted a prospective interventional study between April and November 2020 to evaluate an educational program intended for family medicine residents at the University of Montpellier-Nîmes, France. They were asked to complete an e-learning course consisting of 2 modules, with an assessment quiz repeated at 1 (M1) and 3 months (M3). The course was based on a 2-step algorithm, a method of dermoscopic analysis of pigmented skin lesions that is internationally accepted. The objectives of modules 1 and 2 were to differentiate melanocytic lesions from nonmelanocytic lesions and to precisely identify skin lesions by looking for dermoscopic morphological criteria specific to each lesion. Each module consisted of 15 questions with immediate feedback after each question. Results: In total, 134 residents were included, and 66.4% (n=89) and 47% (n=63) of trainees fully participated in the evaluation of module 1 and module 2, respectively. This study showed a significant score improvement 3 months after the training course in 92.1% (n=82) of participants for module 1 and 87.3% (n=55) of participants for module 2 (P<.001). The majority of the participants expressed satisfaction (n=48, 90.6%) with the training course, and 96.3% (n=51) planned to use a dermatoscope in their future practice. Regarding final scores, the only variable that was statistically significant was the resident’s initial scores (P=.003) for module 1. No measured variable was found to be associated with retention (midtraining or final evaluation) for module 2. Residents who had completed at least 1 dermatology rotation during medical school had significantly higher initial scores in module 1 at M0 (P=.03). Residents who reported having completed at least 1 dermatology rotation during their family medicine training had a statistically significant higher score at M1 for module 1 and M3 for module 2 (P=.01 and P=.001). Conclusions: The integration of an e-learning training course in dermoscopy into the curriculum of FM residents results in a significant improvement in their diagnosis skills and meets their expectations. Developing a program combining an e-learning course and face-to-face training for residents is likely to result in more frequent and effective dermoscopy use by family doctors. %M 38739910 %R 10.2196/56005 %U https://formative.jmir.org/2024/1/e56005 %U https://doi.org/10.2196/56005 %U http://www.ncbi.nlm.nih.gov/pubmed/38739910 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e55048 %T Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study %A Rojas,Marcos %A Rojas,Marcelo %A Burgess,Valentina %A Toro-Pérez,Javier %A Salehi,Shima %K artificial intelligence %K AI %K generative artificial intelligence %K medical education %K ChatGPT %K EUNACOM %K medical licensure %K medical license %K medical licensing exam %D 2024 %7 29.4.2024 %9 %J JMIR Med Educ %G English %X Background: The deployment of OpenAI’s ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as “GPT-4 Turbo With Vision”), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile’s medical licensing examinations—a critical step for medical practitioners in Chile—is less explored. This gap highlights the need to evaluate ChatGPT’s adaptability to diverse linguistic and cultural contexts. Objective: This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile. Methods: Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM’s structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. Results: All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT’s performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%). Conclusions: This study reveals ChatGPT’s ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals. %R 10.2196/55048 %U https://mededu.jmir.org/2024/1/e55048 %U https://doi.org/10.2196/55048 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e57054 %T Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study %A Noda,Masao %A Ueno,Takayoshi %A Koshu,Ryota %A Takaso,Yuji %A Shimada,Mari Dias %A Saito,Chizu %A Sugimoto,Hisashi %A Fushiki,Hiroaki %A Ito,Makoto %A Nomura,Akihiro %A Yoshizaki,Tomokazu %+ Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Yakushiji 3311-1, Shimotsuke, 329-0498, Japan, 1 0285442111, doforanabdosuc@gmail.com %K artificial intelligence %K GPT-4v %K large language model %K otolaryngology %K GPT %K ChatGPT %K LLM %K LLMs %K language model %K language models %K head %K respiratory %K ENT: ear %K nose %K throat %K neck %K NLP %K natural language processing %K image %K images %K exam %K exams %K examination %K examinations %K answer %K answers %K answering %K response %K responses %D 2024 %7 28.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence’s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. %M 38546736 %R 10.2196/57054 %U https://mededu.jmir.org/2024/1/e57054 %U https://doi.org/10.2196/57054 %U http://www.ncbi.nlm.nih.gov/pubmed/38546736 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e54401 %T Development of a Clinical Simulation Video to Evaluate Multiple Domains of Clinical Competence: Cross-Sectional Study %A Shikino,Kiyoshi %A Nishizaki,Yuji %A Fukui,Sho %A Yokokawa,Daiki %A Yamamoto,Yu %A Kobayashi,Hiroyuki %A Shimizu,Taro %A Tokuda,Yasuharu %+ Department of Community-Oriented Medical Education, Chiba University Graduate School of Medicine, 1-8-1, Inohana, Chiba, 2608677, Japan, 81 43 222 7171, kshikino@gmail.com %K discrimination index %K General Medicine In-Training Examination %K clinical simulation video %K postgraduate medical education %K video %K videos %K training %K examination %K examinations %K medical education %K resident %K residents %K postgraduate %K postgraduates %K simulation %K simulations %K diagnosis %K diagnoses %K diagnose %K general medicine %K general practice %K general practitioner %K skill %K skills %D 2024 %7 29.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Medical students in Japan undergo a 2-year postgraduate residency program to acquire clinical knowledge and general medical skills. The General Medicine In-Training Examination (GM-ITE) assesses postgraduate residents’ clinical knowledge. A clinical simulation video (CSV) may assess learners’ interpersonal abilities. Objective: This study aimed to evaluate the relationship between GM-ITE scores and resident physicians’ diagnostic skills by having them watch a CSV and to explore resident physicians’ perceptions of the CSV’s realism, educational value, and impact on their motivation to learn. Methods: The participants included 56 postgraduate medical residents who took the GM-ITE between January 21 and January 28, 2021; watched the CSV; and then provided a diagnosis. The CSV and GM-ITE scores were compared, and the validity of the simulations was examined using discrimination indices, wherein ≥0.20 indicated high discriminatory power and >0.40 indicated a very good measure of the subject’s qualifications. Additionally, we administered an anonymous questionnaire to ascertain participants’ views on the realism and educational value of the CSV and its impact on their motivation to learn. Results: Of the 56 participants, 6 (11%) provided the correct diagnosis, and all were from the second postgraduate year. All domains indicated high discriminatory power. The (anonymous) follow-up responses indicated that the CSV format was more suitable than the conventional GM-ITE for assessing clinical competence. The anonymous survey revealed that 12 (52%) participants found the CSV format more suitable than the GM-ITE for assessing clinical competence, 18 (78%) affirmed the realism of the video simulation, and 17 (74%) indicated that the experience increased their motivation to learn. Conclusions: The findings indicated that CSV modules simulating real-world clinical examinations were successful in assessing examinees’ clinical competence across multiple domains. The study demonstrated that the CSV not only augmented the assessment of diagnostic skills but also positively impacted learners’ motivation, suggesting a multifaceted role for simulation in medical education. %M 38421691 %R 10.2196/54401 %U https://mededu.jmir.org/2024/1/e54401 %U https://doi.org/10.2196/54401 %U http://www.ncbi.nlm.nih.gov/pubmed/38421691 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50965 %T Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study %A Meyer,Annika %A Riese,Janik %A Streichert,Thomas %+ Institute for Clinical Chemistry, University Hospital Cologne, Kerpener Str 62, Cologne, 50937, Germany, annika.meyer1@uk-koeln.de %K ChatGPT %K artificial intelligence %K large language model %K medical exams %K medical examinations %K medical education %K LLM %K public trust %K trust %K medical accuracy %K licensing exam %K licensing examination %K improvement %K patient care %K general population %K licensure examination %D 2024 %7 8.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The potential of artificial intelligence (AI)–based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods: To assess GPT-3.5’s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions: The study results highlight ChatGPT’s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4’s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population. %M 38329802 %R 10.2196/50965 %U https://mededu.jmir.org/2024/1/e50965 %U https://doi.org/10.2196/50965 %U http://www.ncbi.nlm.nih.gov/pubmed/38329802 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e53293 %T Evaluating Clinical Outcomes in Patients Being Treated Exclusively via Telepsychiatry: Retrospective Data Analysis %A Person,Cheryl %A O'Connor,Nicola %A Koehler,Lucy %A Venkatachalam,Kartik %A Gaveras,Georgia %+ Talkiatry, 109 W 27th Street Suite 5S, New York, NY, 10001, United States, 1 833 351 8255, cheryl.person@talkiatry.com %K telepsychiatry %K PHQ-8 %K GAD-7 %K clinical outcomes %K rural %K commercial insurance %K telehealth %K depression %K anxiety %K telemental health %K psychiatry %K Generalized Anxiety Disorder-7 %K Patient Health Questionnaire-8 %D 2023 %7 8.12.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: Depression and anxiety are highly prevalent conditions in the United States. Despite the availability of suitable therapeutic options, limited access to high-quality psychiatrists represents a major barrier to treatment. Although telepsychiatry has the potential to improve access to psychiatrists, treatment efficacy in the telepsychiatry model remains unclear. Objective: Our primary objective was to determine whether there was a clinically meaningful change in 1 of 2 validated outcome measures of depression and anxiety—the Patient Health Questionnaire–8 (PHQ-8) or the Generalized Anxiety Disorder–7 (GAD-7)—after receiving at least 8 weeks of treatment in an outpatient telepsychiatry setting. Methods: We included treatment-seeking patients enrolled in a large outpatient telepsychiatry service that accepts commercial insurance. All analyzed patients completed the GAD-7 and PHQ-8 prior to their first appointment and at least once after 8 weeks of treatment. Treatments included comprehensive diagnostic evaluation, supportive psychotherapy, and medication management. Results: In total, 1826 treatment-seeking patients were evaluated for clinically meaningful changes in GAD-7 and PHQ-8 scores during treatment. Mean treatment duration was 103 (SD 34) days. At baseline, 58.8% (1074/1826) and 60.1% (1097/1826) of patients exhibited at least moderate anxiety and depression, respectively. In response to treatment, mean change for GAD-7 was –6.71 (95% CI –7.03 to –6.40) and for PHQ-8 was –6.85 (95% CI –7.18 to –6.52). Patients with at least moderate symptoms at baseline showed a 45.7% reduction in GAD-7 scores and a 43.1% reduction in PHQ-8 scores. Effect sizes for GAD-7 and PHQ-8, as measured by Cohen d for paired samples, were d=1.30 (P<.001) and d=1.23 (P<.001), respectively. Changes in GAD-7 and PHQ-8 scores correlated with the type of insurance held by the patients. Greatest reductions in scores were observed among patients with commercial insurance (45% and 43.9% reductions in GAD-7 and PHQ-8 scores, respectively). Although patients with Medicare did exhibit statistically significant reductions in GAD-7 and PHQ-8 scores from baseline (P<.001), these improvements were attenuated compared to those in patients with commercial insurance (29.2% and 27.6% reduction in GAD-7 and PHQ-8 scores, respectively). Pairwise comparison tests revealed significant differences in treatment responses in patients with Medicare versus commercial insurance (P<.001). Responses were independent of patient geographic classification (urban vs rural; P=.48 for GAD-7 and P=.07 for PHQ-8). The finding that treatment efficacy was comparable among rural and urban patients indicated that telepsychiatry is a promising approach to overcome treatment disparities that stem from geographical constraints. Conclusions: In this large retrospective data analysis of treatment-seeking patients using a telepsychiatry platform, we found robust and clinically significant improvement in depression and anxiety symptoms during treatment. The results provide further evidence that telepsychiatry is highly effective and has the potential to improve access to psychiatric care. %M 37991899 %R 10.2196/53293 %U https://formative.jmir.org/2023/1/e53293 %U https://doi.org/10.2196/53293 %U http://www.ncbi.nlm.nih.gov/pubmed/37991899 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 12 %N %P e48672 %T Evaluation of Incremental Validity of Casper in Predicting Program and National Licensure Performance of Undergraduate Nursing Students: Protocol for a Mixed Methods Study %A Stevens,Kathleen %A Moralejo,Donna %A Crossman,Renee %+ Faculty of Nursing, Memorial University, 300 Prince Phillip Drive, St John's, NL, A1B 3V6, Canada, 1 7098647100, kathleen.stevens@mun.ca %K communication %K empathy %K incremental validity %K mixed methods %K nursing school admissions %K problem-solving %K professionalism %K situational judgement testing %K undergraduate nursing students %D 2023 %7 18.10.2023 %9 Protocol %J JMIR Res Protoc %G English %X Background: Academic success has been the primary criterion for admission to many nursing programs. However, academic success as an admission criterion may have limited predictive value for success in noncognitive skills. Adding situational judgment tests, such as Casper, to admissions procedures may be one strategy to strengthen decisions and address the limited predictive value of academic admission criteria. In 2021, admissions processes were modified to include Casper based on concerns identified with noncognitive skills. Objective: This study aims to (1) assess the incremental validity of Casper scores in predicting nursing student performance at years 1, 2, 3, and 4 and on the National Council Licensing Examination (NCLEX) performance; and (2) examine faculty members’ perceptions of student performance and influences related to communication, professionalism, empathy, and problem-solving. Methods: We will use a multistage evaluation mixed methods design with 5 phases. At the end of each year, students will complete questionnaires related to empathy and professionalism and have their performance assessed for communication and problem-solving in psychomotor laboratory sessions. The final phase will assess graduate performance on the NCLEX. Each phase also includes qualitative data collection (ie, focus groups with faculty members). The goal of the focus groups is to help explain the quantitative findings (explanatory phase) as well as inform data collection (eg, focus group questions) in the subsequent phase (exploratory sequence). All students enrolled in the first year of the nursing program in 2021 were asked to participate (n=290). Faculty will be asked to participate in the focus groups at the end of each year of the program. Hierarchical multiple regression will be conducted for each outcome of interest (eg, communication, professionalism, empathy, and problem-solving) to determine the extent to which scores on Casper with admission grades, compared to admission grades alone, predict nursing student performance at years 1-4 of the program and success on the national exam. Thematic analysis of focus group transcripts will be conducted using interpretive description. The quantitative and qualitative data will be integrated after each phase is complete and at the end of the study. Results: This study was funded in September 2021, and data collection began in March 2022. Year 1 data collection and analysis are complete. Year 2 data collection is complete, and data analysis is in progress. Conclusions: At the end of the study, we will provide the results of a comprehensive analysis to determine the extent to which the addition of scores on Casper compared to admission grades alone predicts nursing student performance at years 1-4 of the program and on the NCLEX exam. International Registered Report Identifier (IRRID): RR1-10.2196/48672 %M 37851504 %R 10.2196/48672 %U https://www.researchprotocols.org/2023/1/e48672 %U https://doi.org/10.2196/48672 %U http://www.ncbi.nlm.nih.gov/pubmed/37851504 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e48023 %T Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study %A Yanagita,Yasutaka %A Yokokawa,Daiki %A Uchida,Shun %A Tawara,Junsuke %A Ikusaka,Masatomi %+ Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-ku, Chiba, 260-8677, Japan, 81 43 222 7171 ext 6438, y.yanagita@gmail.com %K artificial intelligence %K ChatGPT %K GPT-4 %K AI %K National Medical Licensing Examination %K Japanese %K NMLE %D 2023 %7 13.10.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT’s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. Objective: The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. Methods: Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. Results: Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. Conclusions: GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information. %M 37831496 %R 10.2196/48023 %U https://formative.jmir.org/2023/1/e48023 %U https://doi.org/10.2196/48023 %U http://www.ncbi.nlm.nih.gov/pubmed/37831496 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50514 %T Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study %A Huang,Ryan ST %A Lu,Kevin Jia Qi %A Meaney,Christopher %A Kemppainen,Joel %A Punnett,Angela %A Leung,Fok-Han %+ Temerty Faculty of Medicine, University of Toronto, 1 King’s College Cir, Toronto, ON, M5S 1A8, Canada, 1 416 978 6585, ry.huang@mail.utoronto.ca %K medical education %K medical knowledge exam %K artificial intelligence %K AI %K natural language processing %K NLP %K large language model %K LLM %K machine learning, ChatGPT %K GPT-3.5 %K GPT-4 %K education %K language model %K education examination %K testing %K utility %K family medicine %K medical residents %K test %K community %D 2023 %7 19.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language model (LLM)–based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. Objective: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. Methods: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. Results: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). Conclusions: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services. %M 37725411 %R 10.2196/50514 %U https://mededu.jmir.org/2023/1/e50514 %U https://doi.org/10.2196/50514 %U http://www.ncbi.nlm.nih.gov/pubmed/37725411 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 10 %N %P e46120 %T Evaluation of Eye Gaze Dynamics During Physician-Patient-Computer Interaction in Federally Qualified Health Centers: Systematic Analysis %A Almansour,Amal %A Montague,Enid %A Furst,Jacob %A Raicu,Daniela %+ Department of Mechanical & Industrial Engineering, University of Toronto, 5 King’s College Road, Toronto, ON, M5S 3G8, Canada, 1 416 978 3040, enid.montague@utoronto.ca %K patient-physician-computer interaction %K nonverbal communication %K Federally Qualified Health Centers %K primary care encounter %D 2023 %7 8.9.2023 %9 Original Paper %J JMIR Hum Factors %G English %X Background: Understanding the communication between physicians and patients can identify areas where they can improve and build stronger relationships. This led to better patient outcomes including increased engagement, enhanced adherence to treatment plan, and a boost in trust. Objective: This study investigates eye gaze directions of physicians, patients, and computers in naturalistic medical encounters at Federally Qualified Health Centers to understand communication patterns given different patients’ diverse backgrounds. The aim is to support the building and designing of health information technologies, which will facilitate the improvement of patient outcomes. Methods: Data were obtained from 77 videotaped medical encounters in 2014 from 3 Federally Qualified Health Centers in Chicago, Illinois, that included 11 physicians and 77 patients. Self-reported surveys were collected from physicians and patients. A systematic analysis approach was used to thoroughly examine and analyze the data. The dynamics of eye gazes during interactions between physicians, patients, and computers were evaluated using the lag sequential analysis method. The objective of the study was to identify significant behavior patterns from the 6 predefined patterns initiated by both physicians and patients. The association between eye gaze patterns was examined using the Pearson chi-square test and the Yule Q test. Results: The results of the lag sequential method showed that 3 out of 6 doctor-initiated gaze patterns were followed by patient-response gaze patterns. Moreover, 4 out of 6 patient-initiated patterns were significantly followed by doctor-response gaze patterns. Unlike the findings in previous studies, doctor-initiated eye gaze behavior patterns were not leading patients’ eye gaze. Moreover, patient-initiated eye gaze behavior patterns were significant in certain circumstances, particularly when interacting with physicians. Conclusions: This study examined several physician-patient-computer interaction patterns in naturalistic settings using lag sequential analysis. The data indicated a significant influence of the patients’ gazes on physicians. The findings revealed that physicians demonstrated a higher tendency to engage with patients by reciprocating the patient’s eye gaze when the patient looked at them. However, the reverse pattern was not observed, suggesting a lack of reciprocal gaze from patients toward physicians and a tendency to not direct their gaze toward a specific object. Furthermore, patients exhibited a preference for the computer when physicians directed their eye gaze toward it. %M 37682590 %R 10.2196/46120 %U https://humanfactors.jmir.org/2023/1/e46120 %U https://doi.org/10.2196/46120 %U http://www.ncbi.nlm.nih.gov/pubmed/37682590 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50336 %T Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations %A Gilson,Aidan %A Safranek,Conrad W %A Huang,Thomas %A Socrates,Vimig %A Chi,Ling %A Taylor,Richard Andrew %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 100 College Street, 9th Fl, New Haven, CT, 06510, United States, 1 203 737 5379, david.chartash@yale.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K AI %K education technology %K ChatGPT %K conversational agent %K machine learning %K large language models %K knowledge assessment %D 2023 %7 13.7.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X %M 37440299 %R 10.2196/50336 %U https://mededu.jmir.org/2023/1/e50336 %U https://doi.org/10.2196/50336 %U http://www.ncbi.nlm.nih.gov/pubmed/37440299 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48305 %T Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” %A Epstein,Richard H %A Dexter,Franklin %+ Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, 1400 NW 12th Ave, Suite 4022F, Miami, FL, 33136, United States, 1 215 896 7850, repstein@med.miami.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K AI %K education technology %K ChatGPT %K Google Bard %K conversational agent %K machine learning %K large language models %K knowledge assessment %D 2023 %7 13.7.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X %M 37440293 %R 10.2196/48305 %U https://mededu.jmir.org/2023/1/e48305 %U https://doi.org/10.2196/48305 %U http://www.ncbi.nlm.nih.gov/pubmed/37440293 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 12 %N %P e40545 %T Variation in Experiences and Attainment in Surgery Between Ethnicities of UK Medical Students and Doctors (ATTAIN): Protocol for a Cross-Sectional Study %A Babiker,Samar %A Ogunmwonyi,Innocent %A Georgi,Maria W %A Tan,Lawrence %A Haque,Sharmi %A Mullins,William %A Singh,Prisca %A Ang,Nadya %A Fu,Howell %A Patel,Krunal %A Khera,Jevan %A Fricker,Monty %A Fleming,Simon %A Giwa-Brown,Lolade %A A Brennan,Peter %A Irune,Ekpemi %A Vig,Stella %A Nathan,Arjun %+ University College London Medical School, Goodge Street, London, WC1E 6BT, United Kingdom, 44 07578636123, maria.georgi@icloud.com %K diversity in surgery %K Black and Minority Ethnic %K BME in surgery %K differential attainment %K diversity %K surgery %K health care system %K surgical training %K disparity %K ethnic disparity %K ethnicity %K medical student %K doctor %K training experience %K surgical placements %K physician %K health care provider %K experience %K perception %K cross-sectional %K doctor in training %K resident %K fellow %K fellowship %K questionnaire %K survey %K Everyday Discrimination Scale %K Maslach Burnout Inventory %K Higher Education %K ethnicities %D 2023 %7 16.6.2023 %9 Protocol %J JMIR Res Protoc %G English %X Background: The unequal distribution of academic and professional outcomes between different minority groups is a pervasive issue in many fields, including surgery. The implications of differential attainment remain significant, not only for the individuals affected but also for the wider health care system. An inclusive health care system is crucial in meeting the needs of an increasingly diverse patient population, thereby leading to better outcomes. One barrier to diversifying the workforce is the differential attainment in educational outcomes between Black and Minority Ethnic (BME) and White medical students and doctors in the United Kingdom. BME trainees are known to have lower performance rates in medical examinations, including undergraduate and postgraduate exams, Annual Review of Competence Progression, as well as training and consultant job applications. Studies have shown that BME candidates have a higher likelihood of failing both parts of the Membership of the Royal Colleges of Surgeons exams and are 10% less likely to be considered suitable for core surgical training. Several contributing factors have been identified; however, there has been limited evidence investigating surgical training experiences and their relationship to differential attainment. To understand the nature of differential attainment in surgery and to develop effective strategies to address it, it is essential to examine the underlying causes and contributing factors. The Variation in Experiences and Attainment in Surgery Between Ethnicities of UK Medical Students and Doctors (ATTAIN) study aims to describe and compare the factors and outcomes of attainment between different ethnicities of doctors and medical students. Objective: The primary aim will be to compare the effect of experiences and perceptions of surgical education of students and doctors of different ethnicities. Methods: This protocol describes a nationwide cross-sectional study of medical students and nonconsultant grade doctors in the United Kingdom. Participants will complete a web-based questionnaire collecting data on experiences and perceptions of surgical placements as well as self-reported academic attainment data. A comprehensive data collection strategy will be used to collect a representative sample of the population. A set of surrogate markers relevant to surgical training will be used to establish a primary outcome to determine variations in attainment. Regression analyses will be used to identify potential causes for the variation in attainment. Results: Data collected between February 2022 and September 2022 yielded 1603 respondents. Data analysis is yet to be competed. The protocol was approved by the University College London Research Ethics Committee on September 16, 2021 (ethics approval reference 19071/004). The findings will be disseminated through peer-reviewed publications and conference presentations. Conclusions: Drawing upon the conclusions of this study, we aim to make recommendations on educational policy reforms. Additionally, the creation of a large, comprehensive data set can be used for further research. International Registered Report Identifier (IRRID): DERR1-10.2196/40545 %M 37327055 %R 10.2196/40545 %U https://www.researchprotocols.org/2023/1/e40545 %U https://doi.org/10.2196/40545 %U http://www.ncbi.nlm.nih.gov/pubmed/37327055 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e44084 %T Scoring Single-Response Multiple-Choice Items: Scoping Review and Comparison of Different Scoring Methods %A Kanzow,Amelie Friederike %A Schmidt,Dennis %A Kanzow,Philipp %+ Department of Preventive Dentistry, Periodontology and Cariology, University Medical Center Göttingen, Robert-Koch-Strasse 40, Göttingen, 37075, Germany, 49 551 3960870, philipp.kanzow@med.uni-goettingen.de %K alternate-choice %K best-answer %K education %K education system %K educational assessment %K educational measurement %K examination %K multiple choice %K results %K scoring %K scoring system %K single choice %K single response %K scoping review %K test %K testing %K true/false %K true-false %K Type A %D 2023 %7 19.5.2023 %9 Review %J JMIR Med Educ %G English %X Background: Single-choice items (eg, best-answer items, alternate-choice items, single true-false items) are 1 type of multiple-choice items and have been used in examinations for over 100 years. At the end of every examination, the examinees’ responses have to be analyzed and scored to derive information about examinees’ true knowledge. Objective: The aim of this paper is to compile scoring methods for individual single-choice items described in the literature. Furthermore, the metric expected chance score and the relation between examinees’ true knowledge and expected scoring results (averaged percentage score) are analyzed. Besides, implications for potential pass marks to be used in examinations to test examinees for a predefined level of true knowledge are derived. Methods: Scoring methods for individual single-choice items were extracted from various databases (ERIC, PsycInfo, Embase via Ovid, MEDLINE via PubMed) in September 2020. Eligible sources reported on scoring methods for individual single-choice items in written examinations including but not limited to medical education. Separately for items with n=2 answer options (eg, alternate-choice items, single true-false items) and best-answer items with n=5 answer options (eg, Type A items) and for each identified scoring method, the metric expected chance score and the expected scoring results as a function of examinees’ true knowledge using fictitious examinations with 100 single-choice items were calculated. Results: A total of 21 different scoring methods were identified from the 258 included sources, with varying consideration of correctly marked, omitted, and incorrectly marked items. Resulting credit varied between –3 and +1 credit points per item. For items with n=2 answer options, expected chance scores from random guessing ranged between –1 and +0.75 credit points. For items with n=5 answer options, expected chance scores ranged between –2.2 and +0.84 credit points. All scoring methods showed a linear relation between examinees’ true knowledge and the expected scoring results. Depending on the scoring method used, examination results differed considerably: Expected scoring results from examinees with 50% true knowledge ranged between 0.0% (95% CI 0% to 0%) and 87.5% (95% CI 81.0% to 94.0%) for items with n=2 and between –60.0% (95% CI –60% to –60%) and 92.0% (95% CI 86.7% to 97.3%) for items with n=5. Conclusions: In examinations with single-choice items, the scoring result is not always equivalent to examinees’ true knowledge. When interpreting examination scores and setting pass marks, the number of answer options per item must usually be taken into account in addition to the scoring method used. %M 37001510 %R 10.2196/44084 %U https://mededu.jmir.org/2023/1/e44084 %U https://doi.org/10.2196/44084 %U http://www.ncbi.nlm.nih.gov/pubmed/37001510 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e43792 %T Use of Multiple-Select Multiple-Choice Items in a Dental Undergraduate Curriculum: Retrospective Study Involving the Application of Different Scoring Methods %A Kanzow,Philipp %A Schmidt,Dennis %A Herrmann,Manfred %A Wassmann,Torsten %A Wiegand,Annette %A Raupach,Tobias %+ Department of Preventive Dentistry, Periodontology and Cariology, University Medical Center Göttingen, Robert-Koch-Str 40, Göttingen, 37075, Germany, 49 551 3960870, philipp.kanzow@med.uni-goettingen.de %K dental education %K education system %K educational assessment %K educational measurement %K examination %K k of n %K Kprim %K K’ %K MTF %K Multiple-True-False %K Pick-N %K scoring %K scoring system %K Type X %K undergraduate %K undergraduate curriculum %K undergraduate education %D 2023 %7 27.3.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Scoring and awarding credit are more complex for multiple-select items than for single-choice items. Forty-one different scoring methods were retrospectively applied to 2 multiple-select multiple-choice item types (Pick-N and Multiple-True-False [MTF]) from existing examination data. Objective: This study aimed to calculate and compare the mean scores for both item types by applying different scoring methods, and to investigate the effect of item quality on mean raw scores and the likelihood of resulting scores at or above the pass level (≥0.6). Methods: Items and responses from examinees (ie, marking events) were retrieved from previous examinations. Different scoring methods were retrospectively applied to the existing examination data to calculate corresponding examination scores. In addition, item quality was assessed using a validated checklist. Statistical analysis was performed using the Kruskal-Wallis test, Wilcoxon rank-sum test, and multiple logistic regression analysis (P<.05). Results: We analyzed 1931 marking events of 48 Pick-N items and 828 marking events of 18 MTF items. For both item types, scoring results widely differed between scoring methods (minimum: 0.02, maximum: 0.98; P<.001). Both the use of an inappropriate item type (34 items) and the presence of cues (30 items) impacted the scoring results. Inappropriately used Pick-N items resulted in lower mean raw scores (0.88 vs 0.93; P<.001), while inappropriately used MTF items resulted in higher mean raw scores (0.88 vs 0.85; P=.001). Mean raw scores were higher for MTF items with cues than for those without cues (0.91 vs 0.8; P<.001), while mean raw scores for Pick-N items with and without cues did not differ (0.89 vs 0.90; P=.09). Item quality also impacted the likelihood of resulting scores at or above the pass level (odds ratio ≤6.977). Conclusions: Educators should pay attention when using multiple-select multiple-choice items and select the most appropriate item type. Different item types, different scoring methods, and presence of cues are likely to impact examinees’ scores and overall examination results. %M 36841970 %R 10.2196/43792 %U https://mededu.jmir.org/2023/1/e43792 %U https://doi.org/10.2196/43792 %U http://www.ncbi.nlm.nih.gov/pubmed/36841970 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e45312 %T How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment %A Gilson,Aidan %A Safranek,Conrad W %A Huang,Thomas %A Socrates,Vimig %A Chi,Ling %A Taylor,Richard Andrew %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 300 George Street, Suite 501, New Haven, CT, 06511, United States, 1 203 737 5379, david.chartash@yale.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K education technology %K ChatGPT %K conversational agent %K machine learning %K USMLE %D 2023 %7 8.2.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. %M 36753318 %R 10.2196/45312 %U https://mededu.jmir.org/2023/1/e45312 %U https://doi.org/10.2196/45312 %U http://www.ncbi.nlm.nih.gov/pubmed/36753318 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 8 %N 3 %P e32840 %T Patterns of Skills Acquisition in Anesthesiologists During Simulated Interscalene Block Training on a Soft Embalmed Thiel Cadaver: Cohort Study %A McLeod,Graeme %A McKendrick,Mel %A Tafili,Tedis %A Obregon,Mateo %A Neary,Ruth %A Mustafa,Ayman %A Raju,Pavan %A Kean,Donna %A McKendrick,Gary %A McKendrick,Tuesday %+ Ninewells Hospital, James Arnott Drive, Dundee, DD1 9SY, United Kingdom, 44 1382 632175, g.a.mcleod@dundee.ac.uk %K regional anesthesia %K ultrasonography %K simulation %K learning curves %K eye tracking %D 2022 %7 11.8.2022 %9 Original Paper %J JMIR Med Educ %G English %X Background: The demand for regional anesthesia for major surgery has increased considerably, but only a small number of anesthesiologists can provide such care. Simulations may improve clinical performance. However, opportunities to rehearse procedures are limited, and the clinical educational outcomes prescribed by the Royal College of Anesthesiologists training curriculum 2021 are difficult to attain. Educational paradigms, such as mastery learning and dedicated practice, are increasingly being used to teach technical skills to enhance skills acquisition. Moreover, high-fidelity, resilient cadaver simulators are now available: the soft embalmed Thiel cadaver shows physical characteristics and functional alignment similar to those of patients. Tissue elasticity allows tissues to expand and relax, fluid to drain away, and hundreds of repeated injections to be tolerated without causing damage. Learning curves and their intra- and interindividual dynamics have not hitherto been measured on the Thiel cadaver simulator using the mastery learning and dedicated practice educational paradigm coupled with validated, quantitative metrics, such as checklists, eye tracking metrics, and self-rating scores. Objective: Our primary objective was to measure the learning slopes of the scanning and needling phases of an interscalene block conducted repeatedly on a soft embalmed Thiel cadaver over a 3-hour period of training. Methods: A total of 30 anesthesiologists, with a wide range of experience, conducted up to 60 ultrasound-guided interscalene blocks over 3 hours on the left side of 2 soft embalmed Thiel cadavers. The duration of the scanning and needling phases was defined as the time taken to perform all the steps correctly. The primary outcome was the best-fit linear slope of the log-log transformed time to complete each phase. Our secondary objectives were to measure preprocedural psychometrics, describe deviations from the learning slope, correlate scanning and needling phase data, characterize skills according to clinical grade, measure learning curves using objective eye gaze tracking and subjective self-rating measures, and use cluster analysis to categorize performance irrespective of grade. Results: The median (IQR; range) log-log learning slopes were −0.47 (−0.62 to −0.32; −0.96 to 0.30) and −0.23 (−0.34 to −0.19; −0.71 to 0.27) during the scanning and needling phases, respectively. Locally Weighted Scatterplot Smoother curves showed wide variability in within-participant performance. The learning slopes of the scanning and needling phases correlated: ρ=0.55 (0.23-0.76), P<.001, and ρ=−0.72 (−0.46 to −0.87), P<.001, respectively. Eye gaze fixation count and glance count during the scanning and needling phases best reflected block duration. Using clustering techniques, fixation count and glance were used to identify 4 distinct patterns of learning behavior. Conclusions: We quantified learning slopes by log-log transformation of the time taken to complete the scanning and needling phases of interscalene blocks and identified intraindividual and interindividual patterns of variability. %M 35543314 %R 10.2196/32840 %U https://mededu.jmir.org/2022/3/e32840 %U https://doi.org/10.2196/32840 %U http://www.ncbi.nlm.nih.gov/pubmed/35543314 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 11 %N 5 %P e34990 %T A Scalable Service to Improve Health Care Quality Through Precision Audit and Feedback: Proposal for a Randomized Controlled Trial %A Landis-Lewis,Zach %A Flynn,Allen %A Janda,Allison %A Shah,Nirav %+ Department of Learning Health Sciences, University of Michigan Medical School, 1161J North Ingalls Building, 300 N. Ingalls St, Ann Arbor, MI, 48109-5403, United States, 1 7346151313, zachll@umich.edu %K learning health system %K audit and feedback %K anesthesiology %K knowledge-based system %K human-centered design %D 2022 %7 10.5.2022 %9 Proposal %J JMIR Res Protoc %G English %X Background: Health care delivery organizations lack evidence-based strategies for using quality measurement data to improve performance. Audit and feedback (A&F), the delivery of clinical performance summaries to providers, demonstrates the potential for large effects on clinical practice but is currently implemented as a blunt one size fits most intervention. Each provider in a care setting typically receives a performance summary of identical metrics in a common format despite the growing recognition that precisionizing interventions hold significant promise in improving their impact. A precision approach to A&F prioritizes the display of information in a single metric that, for each recipient, carries the highest value for performance improvement, such as when the metric’s level drops below a peer benchmark or minimum standard for the first time, thereby revealing an actionable performance gap. Furthermore, precision A&F uses an optimal message format (including framing and visual displays) based on what is known about the recipient and the intended gist meaning being communicated to improve message interpretation while reducing the cognitive processing burden. Well-established psychological principles, frameworks, and theories form a feedback intervention knowledge base to achieve precision A&F. From an informatics perspective, precision A&F requires a knowledge-based system that enables mass customization by representing knowledge configurable at the group and individual levels. Objective: This study aims to implement and evaluate a demonstration system for precision A&F in anesthesia care and to assess the effect of precision feedback emails on care quality and outcomes in a national quality improvement consortium. Methods: We propose to achieve our aims by conducting 3 studies: a requirements analysis and preferences elicitation study using human-centered design and conjoint analysis methods, a software service development and implementation study, and a cluster randomized controlled trial of a precision A&F service with a concurrent process evaluation. This study will be conducted with the Multicenter Perioperative Outcomes Group, a national anesthesia quality improvement consortium with >60 member hospitals in >20 US states. This study will extend the Multicenter Perioperative Outcomes Group quality improvement infrastructure by using existing data and performance measurement processes. Results: The proposal was funded in September 2021 with a 4-year timeline. Data collection for Aim 1 began in March 2022. We plan for a 24-month trial timeline, with the intervention period of the trial beginning in March 2024. Conclusions: The proposed aims will collectively demonstrate a precision feedback service developed using an open-source technical infrastructure for computable knowledge management. By implementing and evaluating a demonstration system for precision feedback, we create the potential to observe the conditions under which feedback interventions are effective. International Registered Report Identifier (IRRID): PRR1-10.2196/34990 %M 35536637 %R 10.2196/34990 %U https://www.researchprotocols.org/2022/5/e34990 %U https://doi.org/10.2196/34990 %U http://www.ncbi.nlm.nih.gov/pubmed/35536637 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 9 %N 1 %P e35199 %T The Effects of Introducing a Mobile App–Based Procedural Logbook on Trainee Compliance to a Central Venous Catheter Insertion Accreditation Program: Before-and-After Study %A Tamblyn,Robert %A Brieva,Jorge %A Cain,Madeleine %A Martinez,F Eduardo %+ Division of Critical Care, John Hunter Hospital, Hunter New England Local Health District, Lookout Rd, New Lambton Heights, Newcastle, 2305, Australia, 61 432032151, robert.d.tamblyn@gmail.com %K logbook %K education %K training %K central venous catheter %K CVC %K intensive care %K smartphone %K mobile phone %K mobile apps %K mHealth %K mobile health %K accreditation program %K digital health %K digital record %D 2022 %7 7.3.2022 %9 Original Paper %J JMIR Hum Factors %G English %X Background: To reduce complications associated with central venous catheter (CVC) insertions, local accreditation programs using a supervised procedural logbook are essential. To increase compliance with such a logbook, a mobile app could provide the ideal platform for training doctors in an adult intensive care unit (ICU). Objective: The aim of this paper was to compare trainee compliance with the completion of a logbook as part of a CVC insertion accreditation program, before and after the introduction of an app-based logbook. Methods: This is a retrospective observational study of logbook data, before and after the introduction of a purpose-built, app-based, electronic logbook to complement an existing paper-based logbook. Carried out over a 2-year period in the adult ICU of the John Hunter Hospital, Newcastle, NSW, Australia, the participants were ICU trainee medical officers completing a CVC insertion accreditation program. The primary outcome was the proportion of all CVC insertions documented in the patients’ electronic medical records appearing as logbook entries. To assess logbook entry quality, we measured and compared the proportion of logbook entries that were approved by a supervisor and contained a supervisor’s signature for the before and after periods. We also analyzed trainee participation before and after the intervention by comparing the total number of active logbook users, and the proportion of first-time users who logged 3 or more CVC insertions. Results: Of the 2987 CVC insertions documented in the electronic medical records between April 7, 2019, and April 6, 2021, 2161 (72%) were included and separated into cohorts before and after the app’s introduction. Following the introduction of the app-based logbook, the percentage of CVC insertions appearing as logbook entries increased from 3.6% (38/1059) to 20.5% (226/1102; P<.001). There was no difference in the proportion of supervisor-approved entries containing a supervisor’s signature before and after the introduction of the app, with 76.3% (29/38) and 83.2% (188/226), respectively (P=.31). After the introduction of the app, there was an increase in the percentage of active logbook users from 15.3% (13/85) to 62.8% (54/86; P<.001). Adherence to one’s logbook was similar in both groups with 60% (6/10) of first-time users in the before group and 79.5% (31/39) in the after group going on to log at least 3 or more CVCs during their time working in ICU. Conclusions: The addition of an electronic app-based logbook to a preexisting paper-based logbook was associated with a higher rate of logbook compliance in trainee doctors undertaking an accreditation program for CVC insertion in an adult ICU. There was a large increase in logbook use observed without a reduction in the quality of logbook entries. The overall trainee participation also improved with an observed increase in active logbook users and no reduction in the average number of entries per user following the introduction of the app. Further studies on app-based logbooks for ICU procedural accreditation programs are warranted. %M 35051900 %R 10.2196/35199 %U https://humanfactors.jmir.org/2022/1/e35199 %U https://doi.org/10.2196/35199 %U http://www.ncbi.nlm.nih.gov/pubmed/35051900 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 5 %N 8 %P e23834 %T Efficiency, Usability, and Outcomes of Proctored Next-Level Exams for Proficiency Testing in Primary Care Education: Observational Study %A Schoenmakers,Birgitte %A Wens,Johan %+ Department of Public Health and Primary Care, KU Leuven, Kapucijnenvoer 33/7001, Leuven, 3000, Belgium, 32 495235639, birgitte.schoenmakers@kuleuven.be %K primary care %K education %K graduate %K medical education %K testing %K assessment %K app %K COVID-19 %K efficiency %K accuracy %D 2021 %7 16.8.2021 %9 Original Paper %J JMIR Form Res %G English %X Background: The COVID-19 pandemic has affected education and assessment programs and has resulted in complex planning. Therefore, we organized the proficiency test for admission to the Family Medicine program as a proctored exam. To prevent fraud, we developed a web-based supervisor app for tracking and tracing candidates’ behaviors. Objective: We aimed to assess the efficiency and usability of the proctored exam procedure and to analyze the procedure’s impact on exam scores. Methods: The application operated on the following three levels to register events: the recording of actions, analyses of behavior, and live supervision. Each suspicious event was given a score. To assess efficiency, we logged the technical issues and the interventions. To test usability, we counted the number of suspicious students and behaviors. To analyze the impact that the supervisor app had on students’ exam outcomes, we compared the scores of the proctored group and those of the on-campus group. Candidates were free to register for off-campus participation or on-campus participation. Results: Of the 593 candidates who subscribed to the exam, 472 (79.6%) used the supervisor app and 121 (20.4%) were on campus. The test results of both groups were comparable. We registered 15 technical issues that occurred off campus. Further, 2 candidates experienced a negative impact on their exams due to technical issues. The application detected 22 candidates with a suspicion rating of >1. Suspicion ratings mainly increased due to background noise. All events occurred without fraudulent intent. Conclusions: This pilot observational study demonstrated that a supervisor app that records and registers behavior was able to detect suspicious events without having an impact on exams. Background noise was the most critical event. There was no fraud detected. A supervisor app that registers and records behavior to prevent fraud during exams was efficient and did not affect exam outcomes. In future research, a controlled study design should be used to compare the cost-benefit balance between the complex interventions of the supervisor app and candidates’ awareness of being monitored via a safe browser plug-in for exams. %M 34398786 %R 10.2196/23834 %U https://formative.jmir.org/2021/8/e23834 %U https://doi.org/10.2196/23834 %U http://www.ncbi.nlm.nih.gov/pubmed/34398786 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 7 %N 2 %P e28733 %T Audiovisual Content for a Radiology Fellowship Selection Process During the COVID-19 Pandemic: Pilot Web-Based Questionnaire Study %A Godoy,Ivan Rodrigues Barros %A Neto,Luís Pecci %A Skaf,Abdalla %A Leão-Filho,Hilton Muniz %A Freddi,Tomás De Andrade Lourenço %A Jasinowodolinski,Dany %A Yamada,André Fukunishi %+ Department of Radiology, Hospital do Coração and Teleimagem, Rua Desembargador Eliseu Guilherme 147, São Paulo, 04004-030, Brazil, 55 11996171704, ivanrbgodoy@gmail.com %K audiovisual reports %K COVID-19 %K fellowship %K radiology %K smartphones %K video recording %K web technology %D 2021 %7 20.5.2021 %9 Original Paper %J JMIR Med Educ %G English %X Background: Traditional radiology fellowships are usually 1- or 2-year clinical training programs in a specific area after completion of a 4-year residency program. Objective: This study aimed to investigate the experience of fellowship applicants in answering radiology questions in an audiovisual format using their own smartphones after answering radiology questions in a traditional printed text format as part of the application process during the COVID-19 pandemic. We hypothesized that fellowship applicants would find that recorded audiovisual radiology content adds value to the conventional selection process, may increase engagement by using their own smartphone device, and facilitate the understanding of imaging findings of radiology-based questions, while maintaining social distancing. Methods: One senior staff radiologist of each subspecialty prepared 4 audiovisual radiology questions for each subspecialty. We conducted a survey using web-based questionnaires for 123 fellowship applications for musculoskeletal (n=39), internal medicine (n=61), and neuroradiology (n=23) programs to evaluate the experience of using audiovisual radiology content as a substitute for the conventional text evaluation. Results: Most of the applicants (n=122, 99%) answered positively (with responses of “agree” or “strongly agree”) that images in digital forms are of superior quality to those printed on paper. In total, 101 (82%) applicants agreed with the statement that the presentation of cases in audiovisual format facilitates the understanding of the findings. Furthermore, 81 (65%) candidates agreed or strongly agreed that answering digital forms is more practical than conventional paper forms. Conclusions: The use of audiovisual content as part of the selection process for radiology fellowships is a new approach to evaluate the potential to enhance the applicant’s experience during this process. This technology also allows for the evaluation of candidates without the need for in-person interaction. Further studies could streamline these methods to minimize work redundancy with traditional text assessments or even evaluate the acceptance of using only audiovisual content on smartphones. %M 33956639 %R 10.2196/28733 %U https://mededu.jmir.org/2021/2/e28733 %U https://doi.org/10.2196/28733 %U http://www.ncbi.nlm.nih.gov/pubmed/33956639 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 7 %N 2 %P e25903 %T The United States Medical Licensing Exam Step 2 Clinical Skills Examination: Potential Alternatives During and After the COVID-19 Pandemic %A Fatima,Rawish %A Assaly,Ahmad R %A Aziz,Muhammad %A Moussa,Mohamad %A Assaly,Ragheb %+ Department of Internal Medicine, University of Toledo Medical Center, 2100 W Central Avenue, Toledo, OH, 43606, United States, 1 5674201613, rawish.f@gmail.com %K USMLE %K United States Medical Licensing Examination %K The National Resident Matching Program %K NRMP %K Step 2 Clinical Skills %K Step 2 CS %K medical school %K medical education %K test %K medical student %K United States %K online learning %K exam %K alternative %K model %K COVID-19 %D 2021 %7 30.4.2021 %9 Viewpoint %J JMIR Med Educ %G English %X We feel that the current COVID-19 crisis has created great uncertainty and anxiety among medical students. With medical school classes initially being conducted on the web and the approaching season of “the Match” (a uniform system by which residency candidates and residency programs in the United States simultaneously “match” with the aid of a computer algorithm to fill first-year and second-year postgraduate training positions accredited by the Accreditation Council for Graduate Medical Education), the situation did not seem to be improving. The National Resident Matching Program made an official announcement on May 26, 2020, that candidates would not be required to take or pass the United States Medical Licensing Examination Step 2 Clinical Skills (CS) examination to participate in the Match. On January 26, 2021, formal discontinuation of Step 2 CS was announced; for this reason, we have provided our perspective of possible alternative solutions to the Step 2 CS examination. A successful alternative model can be implemented in future residency match seasons as well. %M 33878014 %R 10.2196/25903 %U https://mededu.jmir.org/2021/2/e25903 %U https://doi.org/10.2196/25903 %U http://www.ncbi.nlm.nih.gov/pubmed/33878014 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 23 %N 3 %P e21196 %T Assessment of Diagnostic Competences With Standardized Patients Versus Virtual Patients: Experimental Study in the Context of History Taking %A Fink,Maximilian C %A Reitmeier,Victoria %A Stadler,Matthias %A Siebeck,Matthias %A Fischer,Frank %A Fischer,Martin R %+ Institute for Medical Education, University Hospital, LMU Munich, Pettenkoferstraße 8a, Munich, 80336, Germany, 49 089 4400 57428, maximilian.fink@yahoo.com %K clinical reasoning %K medical education %K performance-based assessment %K simulation %K standardized patient %K virtual patient %D 2021 %7 4.3.2021 %9 Original Paper %J J Med Internet Res %G English %X Background: Standardized patients (SPs) have been one of the popular assessment methods in clinical teaching for decades, although they are resource intensive. Nowadays, simulated virtual patients (VPs) are increasingly used because they are permanently available and fully scalable to a large audience. However, empirical studies comparing the differential effects of these assessment methods are lacking. Similarly, the relationships between key variables associated with diagnostic competences (ie, diagnostic accuracy and evidence generation) in these assessment methods still require further research. Objective: The aim of this study is to compare perceived authenticity, cognitive load, and diagnostic competences in performance-based assessment using SPs and VPs. This study also aims to examine the relationships of perceived authenticity, cognitive load, and quality of evidence generation with diagnostic accuracy. Methods: We conducted an experimental study with 86 medical students (mean 26.03 years, SD 4.71) focusing on history taking in dyspnea cases. Participants solved three cases with SPs and three cases with VPs in this repeated measures study. After each case, students provided a diagnosis and rated perceived authenticity and cognitive load. The provided diagnosis was scored in terms of diagnostic accuracy; the questions asked by the medical students were rated with respect to their quality of evidence generation. In addition to regular null hypothesis testing, this study used equivalence testing to investigate the absence of meaningful effects. Results: Perceived authenticity (1-tailed t81=11.12; P<.001) was higher for SPs than for VPs. The correlation between diagnostic accuracy and perceived authenticity was very small (r=0.05) and neither equivalent (P=.09) nor statistically significant (P=.32). Cognitive load was equivalent in both assessment methods (t82=2.81; P=.003). Intrinsic cognitive load (1-tailed r=−0.30; P=.003) and extraneous load (1-tailed r=−0.29; P=.003) correlated negatively with the combined score for diagnostic accuracy. The quality of evidence generation was positively related to diagnostic accuracy for VPs (1-tailed r=0.38; P<.001); this finding did not hold for SPs (1-tailed r=0.05; P=.32). Comparing both assessment methods with each other, diagnostic accuracy was higher for SPs than for VPs (2-tailed t85=2.49; P=.01). Conclusions: The results on perceived authenticity demonstrate that learners experience SPs as more authentic than VPs. As higher amounts of intrinsic and extraneous cognitive loads are detrimental to performance, both types of cognitive load must be monitored and manipulated systematically in the assessment. Diagnostic accuracy was higher for SPs than for VPs, which could potentially negatively affect students’ grades with VPs. We identify and discuss possible reasons for this performance difference between both assessment methods. %M 33661122 %R 10.2196/21196 %U https://www.jmir.org/2021/3/e21196 %U https://doi.org/10.2196/21196 %U http://www.ncbi.nlm.nih.gov/pubmed/33661122 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 22 %N 12 %P e23254 %T Simulation Game Versus Multiple Choice Questionnaire to Assess the Clinical Competence of Medical Students: Prospective Sequential Trial %A Fonteneau,Tristan %A Billion,Elodie %A Abdoul,Cindy %A Le,Sebastien %A Hadchouel,Alice %A Drummond,David %+ Department of Paediatric Pulmonology and Allergology, University Hospital Necker-Enfants Malades, Assistance Publique - Hôpitaux de Paris, 149 rue de Sèvres, Paris, , France, 33 1 44 49 48 48, david.drummond@aphp.fr %K serious game %K simulation game %K assessment %K professional competence %K asthma %K pediatrics %D 2020 %7 16.12.2020 %9 Original Paper %J J Med Internet Res %G English %X Background: The use of simulation games (SG) to assess the clinical competence of medical students has been poorly studied. Objective: The objective of this study was to assess whether an SG better reflects the clinical competence of medical students than a multiple choice questionnaire (MCQ). Methods: Fifth-year medical students in Paris (France) were included and individually evaluated on a case of pediatric asthma exacerbation using three successive modalities: high-fidelity simulation (HFS), considered the gold standard for the evaluation of clinical competence, the SG Effic’Asthme, and an MCQ designed for the study. The primary endpoint was the median kappa coefficient evaluating the correlation of the actions performed by the students between the SG and HFS modalities and the MCQ and HFS modalities. Student satisfaction was also evaluated. Results: Forty-two students were included. The actions performed by the students were more reproducible between the SG and HFS modalities than between the MCQ and HFS modalities (P=.04). Students reported significantly higher satisfaction with the SG (P<.01) than with the MCQ modality. Conclusions: The SG Effic’Asthme better reflected the actions performed by medical students during an HFS session than an MCQ on the same asthma exacerbation case. Because SGs allow the assessment of more dimensions of clinical competence than MCQs, they are particularly appropriate for the assessment of medical students on situations involving symptom recognition, prioritization of decisions, and technical skills. Trial Registration: ClinicalTrials.gov NCT03884114; https://clinicaltrials.gov/ct2/show/NCT03884114 %M 33325833 %R 10.2196/23254 %U http://www.jmir.org/2020/12/e23254/ %U https://doi.org/10.2196/23254 %U http://www.ncbi.nlm.nih.gov/pubmed/33325833 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 22 %N 11 %P e23299 %T Role of Technology in Self-Assessment and Feedback Among Hospitalist Physicians: Semistructured Interviews and Thematic Analysis %A Yin,Andrew Lukas %A Gheissari,Pargol %A Lin,Inna Wanyin %A Sobolev,Michael %A Pollak,John P %A Cole,Curtis %A Estrin,Deborah %+ Medical College, Weill Cornell Medicine, 1300 York Avenue, New York, NY, , United States, 1 212 746 5454, aly2011@med.cornell.edu %K feedback %K self-assessment %K self-learning %K hospitalist %K electronic medical record %K digital health %K assessment %K learning %D 2020 %7 3.11.2020 %9 Original Paper %J J Med Internet Res %G English %X Background: Lifelong learning is embedded in the culture of medicine, but there are limited tools currently available for many clinicians, including hospitalists, to help improve their own practice. Although there are requirements for continuing medical education, resources for learning new clinical guidelines, and developing fields aimed at facilitating peer-to-peer feedback, there is a gap in the availability of tools that enable clinicians to learn based on their own patients and clinical decisions. Objective: The aim of this study was to explore the technologies or modifications to existing systems that could be used to benefit hospitalist physicians in pursuing self-assessment and improvement by understanding physicians’ current practices and their reactions to proposed possibilities. Methods: Semistructured interviews were conducted in two separate stages with analysis performed after each stage. In the first stage, interviews (N=12) were conducted to understand the ways in which hospitalist physicians are currently gathering feedback and assessing their practice. A thematic analysis of these interviews informed the prototype used to elicit responses in the second stage. Results: Clinicians actively look for feedback that they can apply to their practice, with the majority of the feedback obtained through self-assessment. The following three themes surrounding this aspect were identified in the first round of semistructured interviews: collaboration, self-reliance, and uncertainty, each with three related subthemes. Using a wireframe, the second round of interviews led to identifying the features that are currently challenging to use or could be made available with technology. Conclusions: Based on each theme and subtheme, we provide targeted recommendations for use by relevant stakeholders such as institutions, clinicians, and technologists. Most hospitalist self-assessments occur on a rolling basis, specifically using data in electronic medical records as their primary source. Specific objective data points or subjective patient relationships lead clinicians to review their patient cases and to assess their own performance. However, current systems are not built for these analyses or for clinicians to perform self-assessment, making this a burdensome and incomplete process. Building a platform that focuses on providing and curating the information used for self-assessment could help physicians make more accurately informed changes to their own clinical practice and decision-making. %M 33141098 %R 10.2196/23299 %U http://www.jmir.org/2020/11/e23299/ %U https://doi.org/10.2196/23299 %U http://www.ncbi.nlm.nih.gov/pubmed/33141098 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 22 %N 8 %P e17719 %T Use of Eye-Tracking Technology by Medical Students Taking the Objective Structured Clinical Examination: Descriptive Study %A Grima-Murcia,M D %A Sanchez-Ferrer,Francisco %A Ramos-Rincón,Jose Manuel %A Fernández,Eduardo %+ Facultad de Medicina, University Miguel Hernández, Edificio de Departamentos, Pediatría, Avinguda de la Universitat d'Elx, Elche, 032, Spain, 34 965169538, pacosanchezferrer0@hotmail.com %K visual perception %K medical education %K eye tracking %K objective structured clinical examination %K medical evaluation %D 2020 %7 21.8.2020 %9 Original Paper %J J Med Internet Res %G English %X Background: The objective structured clinical examination (OSCE) is a test used throughout Spain to evaluate the clinical competencies, decision making, problem solving, and other skills of sixth-year medical students. Objective: The main goal of this study is to explore the possible applications and utility of portable eye-tracking systems in the setting of the OSCE, particularly questions associated with attention and engagement. Methods: We used a portable Tobii Glasses 2 eye tracker, which allows real-time monitoring of where the students were looking and records the voice and ambient sounds. We then performed a qualitative and a quantitative analysis of the fields of vision and gaze points attracting attention as well as the visual itinerary. Results: Eye-tracking technology was used in the OSCE with no major issues. This portable system was of the greatest value in the patient simulators and mannequin stations, where interaction with the simulated patient or areas of interest in the mannequin can be quantified. This technology proved useful to better identify the areas of interest in the medical images provided. Conclusions: Portable eye trackers offer the opportunity to improve the objective evaluation of candidates and the self-evaluation of the stations used as well as medical simulations by examiners. We suggest that this technology has enough resolution to identify where a student is looking at and could be useful for developing new approaches for evaluating specific aspects of clinical competencies. %M 32821060 %R 10.2196/17719 %U http://www.jmir.org/2020/8/e17719/ %U https://doi.org/10.2196/17719 %U http://www.ncbi.nlm.nih.gov/pubmed/32821060 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 6 %N 2 %P e20182 %T The United States Medical Licensing Examination Step 1 Is Changing—US Medical Curricula Should Too %A Liu,Benjamin %+ Medical College of Wisconsin, 8701 W Watertown Plank Rd, Wawautosa, WI, 53226, United States, 1 414 397 1602, beliu@mcw.edu %K USMLE %K US medical students %K USMLE pass/fail %K new curricula %K medical education %K medical learning %K medical school %D 2020 %7 30.7.2020 %9 Viewpoint %J JMIR Med Educ %G English %X In recent years, US medical students have been increasingly absent from medical school classrooms. They do so to maximize their competitiveness for a good residency program, by achieving high scores on the United States Medical Licensing Examination (USMLE) Step 1. As a US medical student, I know that most of these class-skipping students are utilizing external learning resources, which are perceived to be more efficient than traditional lectures. Now that the USMLE Step 1 is adopting a pass/fail grading system, it may be tempting to expect students to return to traditional basic science lectures. Unfortunately, my experiences tell me this will not happen. Instead, US medical schools must adapt their curricula. These new curricula should focus on clinical decision making, team-based learning, and new medical decision technologies, while leveraging the validated ability of these external resources to teach the basic sciences. In doing so, faculty will not only increase student engagement but also modernize the curricula to meet new standards on effective medical learning. %M 32667900 %R 10.2196/20182 %U http://mededu.jmir.org/2020/2/e20182/ %U https://doi.org/10.2196/20182 %U http://www.ncbi.nlm.nih.gov/pubmed/32667900 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 6 %N 1 %P e15444 %T An Objective Structured Clinical Examination for Medical Student Radiology Clerkships: Reproducibility Study %A Staziaki,Pedro Vinícius %A Sarangi,Rutuparna %A Parikh,Ujas N %A Brooks,Jeffrey G %A LeBedis,Christina Alexandra %A Shaffer,Kitt %+ Department of Radiology, Boston Medical Center, Boston University School of Medicine, 820 Harrison Ave, FGH Building, 4th Floor, Boston, MA, United States, 1 6174145135, staziaki@gmail.com %K radiology %K education %K education methods %K medical education %K undergraduate %D 2020 %7 6.5.2020 %9 Original Paper %J JMIR Med Educ %G English %X Background: Objective structured clinical examinations (OSCEs) are a useful method to evaluate medical students’ performance in the clerkship years. OSCEs are designed to assess skills and knowledge in a standardized clinical setting and through use of a preset standard grading sheet, so that clinical knowledge can be evaluated at a high level and in a reproducible way. Objective: This study aimed to present our OSCE assessment tool designed specifically for radiology clerkship medical students, which we called the objective structured radiology examination (OSRE), with the intent to advance the assessment of clerkship medical students by providing an objective, structured, reproducible, and low-cost method to evaluate medical students’ radiology knowledge and the reproducibility of this assessment tool. Methods: We designed 9 different OSRE cases for radiology clerkship classes with participating third- and fourth-year medical students. Each examination comprises 1 to 3 images, a clinical scenario, and structured questions, along with a standardized scoring sheet that allows for an objective and low-cost assessment. Each medical student completed 3 of 9 random examination cases during their rotation. To evaluate for reproducibility of our scoring sheet assessment tool, we used 5 examiners to grade the same students. Reproducibility for each case and consistency for each grader were assessed with a two-way mixed effects intraclass correlation coefficient (ICC). An ICC below 0.4 was deemed poor to fair, an ICC of 0.41 to 0.60 was moderate, an ICC of 0.6 to 0.8 was substantial, and an ICC greater than 0.8 was almost perfect. We also assessed the correlation of scores and the students’ clinical experience with a linear regression model and compared mean grades between third- and fourth-year students. Results: A total of 181 students (156 third- and 25 fourth-year students) were included in the study for a full academic year. Moreover, 6 of 9 cases demonstrated average ICCs more than 0.6 (substantial correlation), and the average ICCs ranged from 0.36 to 0.80 (P<.001 for all the cases). The average ICC for each grader was more than 0.60 (substantial correlation). The average grade among the third-year students was 11.9 (SD 4.9), compared with 12.8 (SD 5) among the fourth-year students (P=.005). There was no correlation between clinical experience and OSRE grade (−0.02; P=.48), adjusting for the medical school year. Conclusions: Our OSRE is a reproducible assessment tool with most of our OSRE cases showing substantial correlation, except for 3 cases. No expertise in radiology is needed to grade these examinations using our scoring sheet. There was no correlation between scores and the clinical experience of the medical students tested. %M 32374267 %R 10.2196/15444 %U http://mededu.jmir.org/2020/1/e15444/ %U https://doi.org/10.2196/15444 %U http://www.ncbi.nlm.nih.gov/pubmed/32374267 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 5 %N 1 %P e10400 %T Video-Based Communication Assessment: Development of an Innovative System for Assessing Clinician-Patient Communication %A Mazor,Kathleen M %A King,Ann M %A Hoppe,Ruth B %A Kochersberger,Annie O %A Yan,Jie %A Reim,Jesse D %+ Meyers Primary Care Institute, University of Massachusetts Medical School, Reliant Medical Group and Fallon Health Plan, 385 Grove St, Meyers Primary Care Institute, Worcester, MA, 01605, United States, 1 508 791 7392, Kathleen.mazor@umassmed.edu %K communication %K crowdsourcing %K health care %K mobile phone %K patient-centered care %K video-based communication assessment %D 2019 %7 14.02.2019 %9 Viewpoint %J JMIR Med Educ %G English %X Good clinician-patient communication is essential to provide quality health care and is key to patient-centered care. However, individuals and organizations seeking to improve in this area face significant challenges. A major barrier is the absence of an efficient system for assessing clinicians’ communication skills and providing meaningful, individual-level feedback. The purpose of this paper is to describe the design and creation of the Video-Based Communication Assessment (VCA), an innovative, flexible system for assessing and ultimately enhancing clinicians’ communication skills. We began by developing the VCA concept. Specifically, we determined that it should be convenient and efficient, accessible via computer, tablet, or smartphone; be case based, using video patient vignettes to which users respond as if speaking to the patient in the vignette; be flexible, allowing content to be tailored to the purpose of the assessment; allow incorporation of the patient’s voice by crowdsourcing ratings from analog patients; provide robust feedback including ratings, links to highly rated responses as examples, and learning points; and ultimately, have strong psychometric properties. We collected feedback on the concept and then proceeded to create the system. We identified several important research questions, which will be answered in subsequent studies. The VCA is a flexible, innovative system for assessing clinician-patient communication. It enables efficient sampling of clinicians’ communication skills, supports crowdsourced ratings of these spoken samples using analog patients, and offers multifaceted feedback reports. %M 30710460 %R 10.2196/10400 %U http://mededu.jmir.org/2019/1/e10400/ %U https://doi.org/10.2196/10400 %U http://www.ncbi.nlm.nih.gov/pubmed/30710460 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 4 %N 2 %P e17 %T Development of a Web-Based Formative Self-Assessment Tool for Physicians to Practice Breaking Bad News (BRADNET) %A Rat,Anne-Christine %A Ricci,Laetitia %A Guillemin,Francis %A Ricatte,Camille %A Pongy,Manon %A Vieux,Rachel %A Spitz,Elisabeth %A Muller,Laurent %+ EA 4360 APEMAC, Université de Lorraine, Rue du Morvan, Nancy,, France, 33 383153203, ac.rat@chru-nancy.fr %K bad news disclosure %K health communication %K physician-patient relationship %K distance e-learning %D 2018 %7 19.07.2018 %9 Original Paper %J JMIR Med Educ %G English %X Background: Although most physicians in medical settings have to deliver bad news, the skills of delivering bad news to patients have been given insufficient attention. Delivering bad news is a complex communication task that includes verbal and nonverbal skills, the ability to recognize and respond to patients’ emotions and the importance of considering the patient’s environment such as culture and social status. How bad news is delivered can have consequences that may affect patients, sometimes over the long term. Objective: This project aimed to develop a Web-based formative self-assessment tool for physicians to practice delivering bad news to minimize the deleterious effects of poor way of breaking bad news about a disease, whatever the disease. Methods: BReaking bAD NEws Tool (BRADNET) items were developed by reviewing existing protocols and recommendations for delivering bad news. We also examined instruments for assessing patient-physician communications and conducted semistructured interviews with patients and physicians. From this step, we selected specific themes and then pooled these themes before consensus was achieved on a good practices communication framework list. Items were then created from this list. To ensure that physicians found BRADNET acceptable, understandable, and relevant to their patients’ condition, the tool was refined by a working group of clinicians familiar with delivering bad news. The think-aloud approach was used to explore the impact of the items and messages and why and how these messages could change physicians’ relations with patients or how to deliver bad news. Finally, formative self-assessment sessions were constructed according to a double perspective of progression: a chronological progression of the disclosure of the bad news and the growing difficulty of items (difficulty concerning the expected level of self-reflection). Results: The good practices communication framework list comprised 70 specific issues related to breaking bad news pooled into 8 main domains: opening, preparing for the delivery of bad news, communication techniques, consultation content, attention, physician emotional management, shared decision making, and the relationship between the physician and the medical team. After constructing the items from this list, the items were extensively refined to make them more useful to the target audience, and one item was added. BRADNET contains 71 items, each including a question, response options, and a corresponding message, which were divided into 8 domains and assessed with 12 self-assessment sessions. The BRADNET Web-based platform was developed according to the cognitive load theory and the cognitive theory of multimedia learning. Conclusions: The objective of this Web-based assessment tool was to create a “space” for reflection. It contained items leading to self-reflection and messages that introduced recommended communication behaviors. Our approach was innovative as it provided an inexpensive distance-learning self-assessment tool that was manageable and less time-consuming for physicians with often overwhelming schedules. %M 30026180 %R 10.2196/mededu.9551 %U http://mededu.jmir.org/2018/2/e17/ %U https://doi.org/10.2196/mededu.9551 %U http://www.ncbi.nlm.nih.gov/pubmed/30026180 %0 Journal Article %@ 2291-9279 %I JMIR Publications %V 5 %N 2 %P e11 %T Medical Student Evaluation With a Serious Game Compared to Multiple Choice Questions Assessment %A Adjedj,Julien %A Ducrocq,Gregory %A Bouleti,Claire %A Reinhart,Louise %A Fabbro,Eleonora %A Elbez,Yedid %A Fischer,Quentin %A Tesniere,Antoine %A Feldman,Laurent %A Varenne,Olivier %+ AP-HP, Hôpital Cochin, Cardiology, 27 rue du Faubourg Saint Jacques, Paris,, France, 33 158412750, olivier.varenne@aphp.fr %K serious game %K multiple choice questions %K medical student %K student evaluation %D 2017 %7 16.05.2017 %9 Original Paper %J JMIR Serious Games %G English %X Background: The gold standard for evaluating medical students’ knowledge is by multiple choice question (MCQs) tests: an objective and effective means of restituting book-based knowledge. However, concerns have been raised regarding their effectiveness to evaluate global medical skills. Furthermore, MCQs of unequal difficulty can generate frustration and may also lead to a sizable proportion of close results with low score variability. Serious games (SG) have recently been introduced to better evaluate students’ medical skills. Objectives: The study aimed to compare MCQs with SG for medical student evaluation. Methods: We designed a cross-over randomized study including volunteer medical students from two medical schools in Paris (France) from January to September 2016. The students were randomized into two groups and evaluated either by the SG first and then the MCQs, or vice-versa, for a cardiology clinical case. The primary endpoint was score variability evaluated by variance comparison. Secondary endpoints were differences in and correlation between the MCQ and SG results, and student satisfaction. Results: A total of 68 medical students were included. The score variability was significantly higher in the SG group (σ2 =265.4) than the MCQs group (σ2=140.2; P=.009). The mean score was significantly lower for the SG than the MCQs at 66.1 (SD 16.3) and 75.7 (SD 11.8) points out of 100, respectively (P<.001). No correlation was found between the two test results (R2=0.04, P=.58). The self-reported satisfaction was significantly higher for SG (P<.001). Conclusions: Our study suggests that SGs are more effective in terms of score variability than MCQs. In addition, they are associated with a higher student satisfaction rate. SGs could represent a new evaluation modality for medical students. %M 28512082 %R 10.2196/games.7033 %U http://games.jmir.org/2017/2/e11/ %U https://doi.org/10.2196/games.7033 %U http://www.ncbi.nlm.nih.gov/pubmed/28512082 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 3 %N 1 %P e4 %T When Educational Material Is Delivered: A Mixed Methods Content Validation Study of the Information Assessment Method %A Badran,Hani %A Pluye,Pierre %A Grad,Roland %+ Information Technology Primary Care Research Group, Department of Family Medicine, McGill University, Department of Family Medicine, 3rd floor, 5858, chemin de la Côte-des-Neige, Montreal, QC, H3S 1Z1, Canada, 1 514 398 8483, pierre.pluye@mcgill.ca %K validity and reliability %K continuing education %K Internet %K electronic mail %K physicians, family %K knowledge translation %K primary health care %D 2017 %7 14.03.2017 %9 Original Paper %J JMIR Med Educ %G English %X Background: The Information Assessment Method (IAM) allows clinicians to report the cognitive impact, clinical relevance, intention to use, and expected patient health benefits associated with clinical information received by email. More than 15,000 Canadian physicians and pharmacists use the IAM in continuing education programs. In addition, information providers can use IAM ratings and feedback comments from clinicians to improve their products. Objective: Our general objective was to validate the IAM questionnaire for the delivery of educational material (ecological and logical content validity). Our specific objectives were to measure the relevance and evaluate the representativeness of IAM items for assessing information received by email. Methods: A 3-part mixed methods study was conducted (convergent design). In part 1 (quantitative longitudinal study), the relevance of IAM items was measured. Participants were 5596 physician members of the Canadian Medical Association who used the IAM. A total of 234,196 ratings were collected in 2012. The relevance of IAM items with respect to their main construct was calculated using descriptive statistics (relevance ratio R). In part 2 (qualitative descriptive study), the representativeness of IAM items was evaluated. A total of 15 family physicians completed semistructured face-to-face interviews. For each construct, we evaluated the representativeness of IAM items using a deductive-inductive thematic qualitative data analysis. In part 3 (mixing quantitative and qualitative parts), results from quantitative and qualitative analyses were reviewed, juxtaposed in a table, discussed with experts, and integrated. Thus, our final results are derived from the views of users (ecological content validation) and experts (logical content validation). Results: Of the 23 IAM items, 21 were validated for content, while 2 were removed. In part 1 (quantitative results), 21 items were deemed relevant, while 2 items were deemed not relevant (R=4.86% [N=234,196] and R=3.04% [n=45,394], respectively). In part 2 (qualitative results), 22 items were deemed representative, while 1 item was not representative. In part 3 (mixing quantitative and qualitative results), the content validity of 21 items was confirmed, and the 2 nonrelevant items were excluded. A fully validated version was generated (IAM-v2014). Conclusions: This study produced a content validated IAM questionnaire that is used by clinicians and information providers to assess the clinical information delivered in continuing education programs. %M 28292738 %R 10.2196/mededu.6415 %U http://mededu.jmir.org/2017/1/e4/ %U https://doi.org/10.2196/mededu.6415 %U http://www.ncbi.nlm.nih.gov/pubmed/28292738 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 18 %N 8 %P e213 %T Web-Based Virtual Microscopy of Digitized Blood Slides for Malaria Diagnosis: An Effective Tool for Skills Assessment in Different Countries and Environments %A Ahmed,Laura %A Seal,Leonard H %A Ainley,Carol %A De la Salle,Barbara %A Brereton,Michelle %A Hyde,Keith %A Burthem,John %A Gilmore,William Samuel %+ Manchester Metropolitan University, School of Healthcare Science, Faculty of Science and Engineering, John Dalton Building, Chester Street, Manchester, M1 5GD, United Kingdom, 44 1612471485, l.ahmed@mmu.ac.uk %K Malaria %K Virtual microscopy %K External quality assessment %K Internet %D 2016 %7 11.08.2016 %9 Original Paper %J J Med Internet Res %G English %X Background: Morphological examination of blood films remains the reference standard for malaria diagnosis. Supporting the skills required to make an accurate morphological diagnosis is therefore essential. However, providing support across different countries and environments is a substantial challenge. Objective: This paper reports a scheme supplying digital slides of malaria-infected blood within an Internet-based virtual microscope environment to users with different access to training and computing facilities. The feasibility of the approach was established, allowing users to test, record, and compare their own performance with that of other users. Methods: From Giemsa stained thick and thin blood films, 56 large high-resolution digital slides were prepared, using high-quality image capture and 63x oil-immersion objective lens. The individual images were combined using the photomerge function of Adobe Photoshop and then adjusted to ensure resolution and reproduction of essential diagnostic features. Web delivery employed the Digital Slidebox platform allowing digital microscope viewing facilities and image annotation with data gathering from participants. Results: Engagement was high with images viewed by 38 participants in five countries in a range of environments and a mean completion rate of 42/56 cases. The rate of parasite detection was 78% and accuracy of species identification was 53%, which was comparable with results of similar studies using glass slides. Data collection allowed users to compare performance with other users over time or for each individual case. Conclusions: Overall, these results demonstrate that users worldwide can effectively engage with the system in a range of environments, with the potential to enhance personal performance through education, external quality assessment, and personal professional development, especially in regions where educational resources are difficult to access. %M 27515009 %R 10.2196/jmir.6027 %U http://www.jmir.org/2016/8/e213/ %U https://doi.org/10.2196/jmir.6027 %U http://www.ncbi.nlm.nih.gov/pubmed/27515009 %0 Journal Article %@ 1438-8871 %I JMIR Publications Inc. %V 17 %N 9 %P e221 %T Designing and Testing an Inventory for Measuring Social Media Competency of Certified Health Education Specialists %A Alber,Julia M %A Bernhardt,Jay M %A Stellefson,Michael %A Weiler,Robert M %A Anderson-Lewis,Charkarra %A Miller,M David %A MacInnes,Jann %+ Center for Health Behavior Research, Perelman School of Medicine, University of Pennsylvania, 110 Blockley Hall, 423 Guardian Drive, Philadelphia, PA, 19104, United States, 1 215 573 9894, alberj@upenn.edu %K social media %K health education %K professional competence %D 2015 %7 23.09.2015 %9 Original Paper %J J Med Internet Res %G English %X Background: Social media can promote healthy behaviors by facilitating engagement and collaboration among health professionals and the public. Thus, social media is quickly becoming a vital tool for health promotion. While guidelines and trainings exist for public health professionals, there are currently no standardized measures to assess individual social media competency among Certified Health Education Specialists (CHES) and Master Certified Health Education Specialists (MCHES). Objective: The aim of this study was to design, develop, and test the Social Media Competency Inventory (SMCI) for CHES and MCHES. Methods: The SMCI was designed in three sequential phases: (1) Conceptualization and Domain Specifications, (2) Item Development, and (3) Inventory Testing and Finalization. Phase 1 consisted of a literature review, concept operationalization, and expert reviews. Phase 2 involved an expert panel (n=4) review, think-aloud sessions with a small representative sample of CHES/MCHES (n=10), a pilot test (n=36), and classical test theory analyses to develop the initial version of the SMCI. Phase 3 included a field test of the SMCI with a random sample of CHES and MCHES (n=353), factor and Rasch analyses, and development of SMCI administration and interpretation guidelines. Results: Six constructs adapted from the unified theory of acceptance and use of technology and the integrated behavioral model were identified for assessing social media competency: (1) Social Media Self-Efficacy, (2) Social Media Experience, (3) Effort Expectancy, (4) Performance Expectancy, (5) Facilitating Conditions, and (6) Social Influence. The initial item pool included 148 items. After the pilot test, 16 items were removed or revised because of low item discrimination (r<.30), high interitem correlations (Ρ>.90), or based on feedback received from pilot participants. During the psychometric analysis of the field test data, 52 items were removed due to low discrimination, evidence of content redundancy, low R-squared value, or poor item infit or outfit. Psychometric analyses of the data revealed acceptable reliability evidence for the following scales: Social Media Self-Efficacy (alpha=.98, item reliability=.98, item separation=6.76), Social Media Experience (alpha=.98, item reliability=.98, item separation=6.24), Effort Expectancy(alpha =.74, item reliability=.95, item separation=4.15), Performance Expectancy (alpha =.81, item reliability=.99, item separation=10.09), Facilitating Conditions (alpha =.66, item reliability=.99, item separation=16.04), and Social Influence (alpha =.66, item reliability=.93, item separation=3.77). There was some evidence of local dependence among the scales, with several observed residual correlations above |.20|. Conclusions: Through the multistage instrument-development process, sufficient reliability and validity evidence was collected in support of the purpose and intended use of the SMCI. The SMCI can be used to assess the readiness of health education specialists to effectively use social media for health promotion research and practice. Future research should explore associations across constructs within the SMCI and evaluate the ability of SMCI scores to predict social media use and performance among CHES and MCHES. %M 26399428 %R 10.2196/jmir.4943 %U http://www.jmir.org/2015/9/e221/ %U https://doi.org/10.2196/jmir.4943 %U http://www.ncbi.nlm.nih.gov/pubmed/26399428