TY - JOUR AU - Bolgova, Olena AU - Shypilova, Inna AU - Mavrych, Volodymyr PY - 2025/4/10 TI - Large Language Models in Biochemistry Education: Comparative Evaluation of Performance JO - JMIR Med Educ SP - e67244 VL - 11 KW - ChatGPT KW - Claude KW - Gemini KW - Copilot KW - biochemistry KW - LLM KW - medical education KW - artificial intelligence KW - NLP KW - natural language processing KW - machine learning KW - large language model KW - AI KW - ML KW - comprehensive analysis KW - medical students KW - GPT-4 KW - questionnaire KW - medical course KW - bioenergetics N2 - Background: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. Objective: The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots?Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)?against the academic results of medical students in the medical biochemistry course. Methods: We used 200 USMLE (United States Medical Licensing Examination)?style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4?1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data?s basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05. Results: On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students? performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04). Conclusions: Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment. UR - https://mededu.jmir.org/2025/1/e67244 UR - http://dx.doi.org/10.2196/67244 ID - info:doi/10.2196/67244 ER - TY - JOUR AU - K?yak, Selim Yavuz AU - Kononowicz, A. Andrzej PY - 2025/4/4 TI - Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG JO - JMIR Form Res SP - e65726 VL - 9 KW - automatic item generation KW - ChatGPT KW - artificial intelligence KW - large language models KW - medical education KW - AI KW - hybrid KW - template-based method KW - hybrid AIG KW - mixed-method KW - multiple-choice question KW - multiple-choice KW - human-AI collaboration KW - human-AI KW - algorithm KW - expert N2 - Background: Template-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items. Objective: We aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education. Methods: This is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population. Results: The hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging. Conclusions: The hybrid AIG method transcends the traditional template-based approach by marrying the ?art? that comes from AI as a ?black box? with the ?science? of algorithmic generation under the oversight of expert as a ?marriage registrar?. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education. UR - https://formative.jmir.org/2025/1/e65726 UR - http://dx.doi.org/10.2196/65726 ID - info:doi/10.2196/65726 ER - TY - JOUR AU - Ba, Hongjun AU - Zhang, Lili AU - He, Xiufang AU - Li, Shujuan PY - 2025/3/26 TI - Knowledge Mapping and Global Trends in Simulation in Medical Education: Bibliometric and Visual Analysis JO - JMIR Med Educ SP - e71844 VL - 11 KW - medical education KW - simulation-based teaching KW - bibliometrics KW - visualization analysis KW - knowledge mapping N2 - Background: With the increasing recognition of the importance of simulation-based teaching in medical education, research in this field has developed rapidly. To comprehensively understand the research dynamics and trends in this area, we conducted an analysis of knowledge mapping and global trends. Objective: This study aims to reveal the research hotspots and development trends in the field of simulation-based teaching in medical education from 2004 to 2024 through bibliometric and visualization analyses. Methods: Using CiteSpace and VOSviewer, we conducted bibliometric and visualization analyses of 6743 articles related to simulation-based teaching in medical education, published in core journals from 2004 to 2024. The analysis included publication trends, contributions by countries and institutions, author contributions, keyword co-occurrence and clustering, and keyword bursts. Results: From 2004 to 2008, the number of articles published annually did not exceed 100. However, starting from 2009, the number increased year by year, reaching a peak of 850 articles in 2024, indicating rapid development in this research field. The United States, Canada, the United Kingdom, Australia, and China published the most articles. Harvard University emerged as a research hub with 1799 collaborative links, although the overall collaboration density was low. Among the 6743 core journal articles, a total of 858 authors were involved, with Lars Konge and Adam Dubrowski being the most prolific. However, collaboration density was low, and the collaboration network was relatively dispersed. A total of 812 common keywords were identified, forming 4189 links. The keywords ?medical education,? ?education,? and ?simulation? had the highest frequency of occurrence. Cluster analysis indicated that ?cardiopulmonary resuscitation? and ?surgical education? were major research hotspots. From 2004 to 2024, a total of 20 burst keywords were identified, among which ?patient simulation,? ?randomized controlled trial,? ?clinical competence,? and ?deliberate practice? had high burst strength. In recent years, ?application of simulation in medical education,? ?3D printing,? ?augmented reality,? and ?simulation training? have become research frontiers. Conclusions: Research on the application of simulation-based teaching in medical education has become a hotspot, with expanding research areas and hotspots. Future research should strengthen interinstitutional collaboration and focus on the application of emerging technologies in simulation-based teaching. UR - https://mededu.jmir.org/2025/1/e71844 UR - http://dx.doi.org/10.2196/71844 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/71844 ER - TY - JOUR AU - Madrid, Julian AU - Diehl, Philipp AU - Selig, Mischa AU - Rolauffs, Bernd AU - Hans, Patricius Felix AU - Busch, Hans-Jörg AU - Scheef, Tobias AU - Benning, Leo PY - 2025/3/21 TI - Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination JO - JMIR Med Educ SP - e58375 VL - 11 KW - medical education KW - artificial intelligence KW - generative AI KW - large language model KW - LLM KW - ChatGPT KW - GPT-4 KW - board licensing examination KW - professional education KW - examination KW - student KW - experimental KW - bootstrapping KW - confidence interval N2 - Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed ?confidence accuracy? to evaluate it. Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain. UR - https://mededu.jmir.org/2025/1/e58375 UR - http://dx.doi.org/10.2196/58375 ID - info:doi/10.2196/58375 ER - TY - JOUR AU - Alreshaid, Lulwah AU - Alkattan, Rana PY - 2025/3/18 TI - Feedback From Dental Students Using Two Alternate Coaching Methods: Qualitative Focus Group Study JO - JMIR Med Educ SP - e68309 VL - 11 KW - student feedback KW - coaching KW - dental education KW - student evaluation KW - teaching methods KW - educational intervention N2 - Background: Student feedback is crucial for evaluating the effectiveness of institutions. However, implementing feedback can be challenging due to practical difficulties. While student feedback on courses can improve teaching, there is a debate about its effectiveness if not well-written to provide helpful information to the receiver. Objective: This study aimed to evaluate the impact of coaching on proper feedback given by dental students in Saudi Arabia. Methods: A total of 47 first-year dental students from a public dental school in Riyadh, Saudi Arabia, completed 3 surveys throughout the academic year. The surveys assessed their feedback on a Dental Anatomy and Operative Dentistry course, including their feedback on the lectures, practical sessions, examinations, and overall experience. The surveys focused on assessing student feedback on the knowledge, understanding, and practical skills achieved during the course, as aligned with the defined course learning outcomes. The surveys were distributed without coaching, after handout coaching and after workshop coaching on how to provide feedback, designated as survey #1, survey #2, and survey #3, respectively. The same group of students received all 3 surveys consecutively (repeated measures design). The responses were then rated as neutral, positive, negative, or constructive by 2 raters. The feedback was analyzed using McNemar test to compare the effectiveness of the different coaching approaches. Results: While no significant changes were found between the first 2 surveys, a significant increase in constructive feedback was observed in survey #3 after workshop coaching compared with both other surveys (P<.001). The results also showed a higher proportion of desired changes in feedback, defined as any change from positive, negative, or neutral to constructive, after survey #3 (P<.001). Overall, 20.2% reported desired changes at survey #2% and 41.5% at survey #3 compared with survey #1. Conclusions: This study suggests that workshops on feedback coaching can effectively improve the quality of feedback provided by dental students. Incorporating feedback coaching into dental school curricula could help students communicate their concerns more effectively, ultimately enhancing the learning experience. UR - https://mededu.jmir.org/2025/1/e68309 UR - http://dx.doi.org/10.2196/68309 ID - info:doi/10.2196/68309 ER - TY - JOUR AU - Pastrak, Mila AU - Kajitani, Sten AU - Goodings, James Anthony AU - Drewek, Austin AU - LaFree, Andrew AU - Murphy, Adrian PY - 2025/3/12 TI - Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study JO - JMIR AI SP - e67696 VL - 4 KW - artificial intelligence KW - ChatGPT-4 KW - medical education KW - emergency medicine KW - examination KW - examination preparation N2 - Background: The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear. Objective: This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation. Methods: A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed. Results: The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability. Conclusions: The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form. UR - https://ai.jmir.org/2025/1/e67696 UR - http://dx.doi.org/10.2196/67696 ID - info:doi/10.2196/67696 ER - TY - JOUR AU - Chudoung, Ubon AU - Saengon, Wilaipon AU - Peonim, Vichan AU - Worasuwannarak, Wisarn PY - 2025/2/10 TI - Comparison of Learning Outcomes Among Medical Students in Thailand to Determine the Right Time to Teach Forensic Medicine: Retrospective Study JO - JMIR Med Educ SP - e57634 VL - 11 KW - multiple-choice question KW - MCQ KW - forensic medicine KW - preclinic KW - clinic KW - medical student N2 - Background: Forensic medicine requires background medical knowledge and the ability to apply it to legal cases. Medical students have different levels of medical knowledge and are therefore likely to perform differently when learning forensic medicine. However, different medical curricula in Thailand deliver forensic medicine courses at different stages of medical study; most curricula deliver these courses in the clinical years, while others offer them in the preclinical years. This raises questions about the differences in learning effectiveness. Objective: We aimed to compare the learning outcomes of medical students in curricula that either teach forensic medicine at the clinical level or teach it at the preclinical level. Methods: This was a 5-year retrospective study that compared multiple-choice question (MCQ) scores in a forensic medicine course for fifth- and third-year medical students. The fifth-year students? program was different from that of the third-year students, but both programs were offered by Mahidol University. The students were taught forensic medicine by the same instructors, used similar content, and were evaluated via examinations of similar difficulty. Of the 1063 medical students included in this study, 782 were fifth-year clinical students, and 281 were third-year preclinical students. Results: The average scores of the fifth- and third-year medical students were 76.09% (SD 6.75%) and 62.94% (SD 8.33%), respectively. The difference was statistically significant (Kruskal-Wallis test: P<.001). Additionally, the average score of fifth-year medical students was significantly higher than that of third-year students in every academic year (all P values were <.001). Conclusions: Teaching forensic medicine during the preclinical years may be too early, and preclinical students may not understand the clinical content sufficiently. Attention should be paid to ensuring that students have the adequate clinical background before teaching subjects that require clinical applications, especially in forensic medicine. UR - https://mededu.jmir.org/2025/1/e57634 UR - http://dx.doi.org/10.2196/57634 ID - info:doi/10.2196/57634 ER - TY - JOUR AU - Brown, Joan AU - De-Oliveira, Sophia AU - Mitchell, Christopher AU - Cesar, Carmen Rachel AU - Ding, Li AU - Fix, Melissa AU - Stemen, Daniel AU - Yacharn, Krisda AU - Wong, Fum Se AU - Dhillon, Anahat PY - 2025/1/24 TI - Barriers to and Facilitators of Implementing Team-Based Extracorporeal Membrane Oxygenation Simulation Study: Exploratory Analysis JO - JMIR Med Educ SP - e57424 VL - 11 KW - intensive care unit KW - ICU KW - teamwork in the ICU KW - team dynamics KW - collaboration KW - interprofessional collaboration KW - simulation KW - simulation training KW - ECMO KW - extracorporeal membrane oxygenation KW - life support KW - cardiorespiratory dysfunction KW - cardiorespiratory KW - cardiology KW - respiratory KW - heart KW - lungs N2 - Introduction: Extracorporeal membrane oxygenation (ECMO) is a critical tool in the care of severe cardiorespiratory dysfunction. Simulation training for ECMO has become standard practice. Therefore, Keck Medicine of the University of California (USC) holds simulation-training sessions to reinforce and improve providers knowledge. Objective: This study aimed to understand the impact of simulation training approaches on interprofessional collaboration. We believed simulation-based ECMO training would improve interprofessional collaboration through increased communication and enhance teamwork. Methods: This was a single-center, mixed methods study of the Cardiac and Vascular Institute Intensive Care Unit at Keck Medicine of USC conducted from September 2021 to April 2023. Simulation training was offered for 1 hour monthly to the clinical team focused on the collaboration and decision-making needed to evaluate the initiation of ECMO therapy. Electronic surveys were distributed before, after, and 3 months post training. The survey evaluated teamwork and the effectiveness of training, and focus groups were held to understand social environment factors. Additionally, trainee and peer evaluation focus groups were held to understand socioenvironmental factors. Results: In total, 37 trainees attended the training simulation from August 2021 to August 2022. Using 27 records for exploratory factor analysis, the standardized Cronbach ? was 0.717. The survey results descriptively demonstrated a positive shift in teamwork ability. Qualitative themes identified improved confidence and decision-making. Conclusions: The study design was flawed, indicating improvement opportunities for future research on simulation training in the clinical setting. The paper outlines what to avoid when designing and implementing studies that assess an educational intervention in a complex clinical setting. The hypothesis deserves further exploration and is supported by the results of this study. UR - https://mededu.jmir.org/2025/1/e57424 UR - http://dx.doi.org/10.2196/57424 ID - info:doi/10.2196/57424 ER - TY - JOUR AU - Wang, Ying-Mei AU - Shen, Hung-Wei AU - Chen, Tzeng-Ji AU - Chiang, Shu-Chiung AU - Lin, Ting-Guan PY - 2025/1/17 TI - Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study JO - JMIR Med Educ SP - e56850 VL - 11 KW - artificial intelligence KW - ChatGPT KW - chat generative pre-trained transformer KW - GPT-4 KW - medical education KW - educational measurement KW - pharmacy licensure KW - Taiwan KW - Taiwan national pharmacist licensing examination KW - learning model KW - AI KW - Chatbot KW - pharmacist KW - evaluation and comparison study KW - pharmacy KW - statistical analyses KW - medical databases KW - medical decision-making KW - generative AI KW - machine learning N2 - Background: OpenAI released versions ChatGPT-3.5 and GPT-4 between 2022 and 2023. GPT-3.5 has demonstrated proficiency in various examinations, particularly the United States Medical Licensing Examination. However, GPT-4 has more advanced capabilities. Objective: This study aims to examine the efficacy of GPT-3.5 and GPT-4 within the Taiwan National Pharmacist Licensing Examination and to ascertain their utility and potential application in clinical pharmacy and education. Methods: The pharmacist examination in Taiwan consists of 2 stages: basic subjects and clinical subjects. In this study, exam questions were manually fed into the GPT-3.5 and GPT-4 models, and their responses were recorded; graphic-based questions were excluded. This study encompassed three steps: (1) determining the answering accuracy of GPT-3.5 and GPT-4, (2) categorizing question types and observing differences in model performance across these categories, and (3) comparing model performance on calculation and situational questions. Microsoft Excel and R software were used for statistical analyses. Results: GPT-4 achieved an accuracy rate of 72.9%, overshadowing GPT-3.5, which achieved 59.1% (P<.001). In the basic subjects category, GPT-4 significantly outperformed GPT-3.5 (73.4% vs 53.2%; P<.001). However, in clinical subjects, only minor differences in accuracy were observed. Specifically, GPT-4 outperformed GPT-3.5 in the calculation and situational questions. Conclusions: This study demonstrates that GPT-4 outperforms GPT-3.5 in the Taiwan National Pharmacist Licensing Examination, particularly in basic subjects. While GPT-4 shows potential for use in clinical practice and pharmacy education, its limitations warrant caution. Future research should focus on refining prompts, improving model stability, integrating medical databases, and designing questions that better assess student competence and minimize guessing. UR - https://mededu.jmir.org/2025/1/e56850 UR - http://dx.doi.org/10.2196/56850 ID - info:doi/10.2196/56850 ER - TY - JOUR AU - Wei, Boxiong PY - 2025/1/16 TI - Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis JO - JMIR Med Educ SP - e64284 VL - 11 KW - large language models KW - LLM KW - artificial intelligence KW - AI KW - GPT-4 KW - radiology exams KW - medical education KW - diagnostics KW - medical training KW - radiology KW - ultrasound N2 - Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using ?2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18?0.60) for Claude, 0.24 (95% CI 0.13?0.44) for Bard, and 0.25 (95% CI 0.14?0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27?0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models? effectiveness in specialized fields like radiology. UR - https://mededu.jmir.org/2025/1/e64284 UR - http://dx.doi.org/10.2196/64284 ID - info:doi/10.2196/64284 ER - TY - JOUR AU - Kaewboonlert, Naritsaret AU - Poontananggul, Jiraphon AU - Pongsuwan, Natthipong AU - Bhakdisongkhram, Gun PY - 2025/1/13 TI - Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study JO - JMIR Med Educ SP - e58898 VL - 11 KW - accuracy KW - performance KW - artificial intelligence KW - AI KW - ChatGPT KW - large language model KW - LLM KW - difficulty index KW - basic medical science examination KW - cross-sectional study KW - medical education KW - datasets KW - assessment KW - medical science KW - tool KW - Google N2 - Background: Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored. Objective: We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations. Methods: We used questions that were closely aligned with the content and topic distribution of Thailand?s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs). Results: The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%?92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%?87.80%), GPT-3.5 at 67.02% (95% CI 61.20%?72.48%), and Google Bard at 63.83% (95% CI 57.92%?69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations. Conclusions: The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item?s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts. UR - https://mededu.jmir.org/2025/1/e58898 UR - http://dx.doi.org/10.2196/58898 ID - info:doi/10.2196/58898 ER - TY - JOUR AU - Zhu, Shiben AU - Hu, Wanqin AU - Yang, Zhi AU - Yan, Jiani AU - Zhang, Fang PY - 2025/1/10 TI - Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study JO - JMIR Med Inform SP - e63731 VL - 13 KW - large language models KW - LLMs KW - Chinese National Nursing Licensing Examination KW - ChatGPT KW - Qwen-2.5 KW - multiple-choice questions KW - N2 - Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain?specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training. UR - https://medinform.jmir.org/2025/1/e63731 UR - http://dx.doi.org/10.2196/63731 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63731 ER - TY - JOUR AU - Zhang, Yong AU - Lu, Xiao AU - Luo, Yan AU - Zhu, Ying AU - Ling, Wenwu PY - 2025/1/9 TI - Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis JO - JMIR Med Inform SP - e63924 VL - 13 KW - chatbots KW - ChatGPT KW - ERNIE Bot KW - performance KW - accuracy rates KW - ultrasound KW - language KW - examination N2 - Background: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. Objective: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. Methods: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. Results: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot?s decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. Conclusions: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use. UR - https://medinform.jmir.org/2025/1/e63924 UR - http://dx.doi.org/10.2196/63924 ID - info:doi/10.2196/63924 ER - TY - JOUR AU - Miyazaki, Yuki AU - Hata, Masahiro AU - Omori, Hisaki AU - Hirashima, Atsuya AU - Nakagawa, Yuta AU - Eto, Mitsuhiro AU - Takahashi, Shun AU - Ikeda, Manabu PY - 2024/12/24 TI - Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions JO - JMIR Med Educ SP - e63129 VL - 10 KW - medical education KW - artificial intelligence KW - clinical decision-making KW - GPT-4o KW - medical licensing examination KW - Japan KW - images KW - accuracy KW - AI technology KW - application KW - decision-making KW - image-based KW - reliability KW - ChatGPT UR - https://mededu.jmir.org/2024/1/e63129 UR - http://dx.doi.org/10.2196/63129 ID - info:doi/10.2196/63129 ER - TY - JOUR AU - Mehyar, Nimer AU - Awawdeh, Mohammed AU - Omair, Aamir AU - Aldawsari, Adi AU - Alshudukhi, Abdullah AU - Alzeer, Ahmed AU - Almutairi, Khaled AU - Alsultan, Sultan PY - 2024/12/16 TI - Long-Term Knowledge Retention of Biochemistry Among Medical Students in Riyadh, Saudi Arabia: Cross-Sectional Survey JO - JMIR Med Educ SP - e56132 VL - 10 KW - biochemistry KW - knowledge KW - retention KW - medical students KW - retention interval KW - Saudi Arabia N2 - Background: Biochemistry is a cornerstone of medical education. Its knowledge is integral to the understanding of complex biological processes and how they are applied in several areas in health care. Also, its significance is reflected in the way it informs the practice of medicine, which can guide and help in both diagnosis and treatment. However, the retention of biochemistry knowledge over time remains a dilemma. Long-term retention of such crucial information is extremely important, as it forms the foundation upon which clinical skills are developed and refined. The effectiveness of biochemistry education, and consequently its long-term retention, is influenced by several factors. Educational methods play a critical role; interactional and integrative teaching approaches have been suggested to enhance retention compared with traditional didactic methods. The frequency and context in which biochemistry knowledge is applied in clinical settings can significantly impact its retention. Practical application reinforces theoretical understanding, making the knowledge more accessible in the long term. Prior knowledge (familiarity) of information suggests that it is stored in long-term memory, which makes its retention in the long term easier to recall. Objectives: This investigation was conducted at King Saud bin Abdulaziz University for Health Sciences in Riyadh, Saudi Arabia. The aim of the study is to understand the dynamics of long-term retention of biochemistry among medical students. Specifically, it looks for the association between students? familiarity with biochemistry content and actual knowledge retention levels. Methods: A cross-sectional correlational survey involving 240 students from King Saud bin Abdulaziz University for Health Sciences was conducted. Participants were recruited via nonprobability convenience sampling. A validated biochemistry assessment tool with 20 questions was used to gauge students? retention in biomolecules, catalysis, bioenergetics, and metabolism. To assess students? familiarity with the knowledge content of test questions, each question is accompanied by options that indicate students? prior knowledge of the content of the question. Statistical analyses tests such as Mann-Whitney U test, Kruskal-Wallis test, and chi-square tests were used. Results: Our findings revealed a significant correlation between students? familiarity of the content with their knowledge retention in the biomolecules (r=0.491; P<.001), catalysis (r=0.500; P<.001), bioenergetics (r=0.528; P<.001), and metabolism (r=0.564; P<.001) biochemistry knowledge domains. Conclusions: This study highlights the significance of familiarity (prior knowledge) in evaluating the retention of biochemistry knowledge. Although limited in terms of generalizability and inherent biases, the research highlights the crucial significance of student?s familiarity in actual knowledge retention of several biochemistry domains. These results might be used by educators to customize instructional methods in order to improve students? long-term retention of biochemistry information and boost their clinical performance. UR - https://mededu.jmir.org/2024/1/e56132 UR - http://dx.doi.org/10.2196/56132 ID - info:doi/10.2196/56132 ER - TY - JOUR AU - Yokokawa, Daiki AU - Shikino, Kiyoshi AU - Nishizaki, Yuji AU - Fukui, Sho AU - Tokuda, Yasuharu PY - 2024/12/5 TI - Evaluation of a Computer-Based Morphological Analysis Method for Free-Text Responses in the General Medicine In-Training Examination: Algorithm Validation Study JO - JMIR Med Educ SP - e52068 VL - 10 KW - General Medicine In-Training Examination KW - free-text response KW - morphological analysis KW - Situation, Background, Assessment, and Recommendation KW - video-based question N2 - Background: The General Medicine In-Training Examination (GM-ITE) tests clinical knowledge in a 2-year postgraduate residency program in Japan. In the academic year 2021, as a domain of medical safety, the GM-ITE included questions regarding the diagnosis from medical history and physical findings through video viewing and the skills in presenting a case. Examinees watched a video or audio recording of a patient examination and provided free-text responses. However, the human cost of scoring free-text answers may limit the implementation of GM-ITE. A simple morphological analysis and word-matching model, thus, can be used to score free-text responses. Objective: This study aimed to compare human versus computer scoring of free-text responses and qualitatively evaluate the discrepancies between human- and machine-generated scores to assess the efficacy of machine scoring. Methods: After obtaining consent for participation in the study, the authors used text data from residents who voluntarily answered the GM-ITE patient reproduction video-based questions involving simulated patients. The GM-ITE used video-based questions to simulate a patient?s consultation in the emergency room with a diagnosis of pulmonary embolism following a fracture. Residents provided statements for the case presentation. We obtained human-generated scores by collating the results of 2 independent scorers and machine-generated scores by converting the free-text responses into a word sequence through segmentation and morphological analysis and matching them with a prepared list of correct answers in 2022. Results: Of the 104 responses collected?63 for postgraduate year 1 and 41 for postgraduate year 2?39 cases remained for final analysis after excluding invalid responses. The authors found discrepancies between human and machine scoring in 14 questions (7.2%); some were due to shortcomings in machine scoring that could be resolved by maintaining a list of correct words and dictionaries, whereas others were due to human error. Conclusions: Machine scoring is comparable to human scoring. It requires a simple program and calibration but can potentially reduce the cost of scoring free-text responses. UR - https://mededu.jmir.org/2024/1/e52068 UR - http://dx.doi.org/10.2196/52068 ID - info:doi/10.2196/52068 ER - TY - JOUR AU - Huang, Ting-Yun AU - Hsieh, Hsing Pei AU - Chang, Yung-Chun PY - 2024/11/21 TI - Performance Comparison of Junior Residents and ChatGPT in the Objective Structured Clinical Examination (OSCE) for Medical History Taking and Documentation of Medical Records: Development and Usability Study JO - JMIR Med Educ SP - e59902 VL - 10 KW - large language model KW - medical history taking KW - clinical documentation KW - simulation-based evaluation KW - OSCE standards KW - LLM N2 - Background: This study explores the cutting-edge abilities of large language models (LLMs) such as ChatGPT in medical history taking and medical record documentation, with a focus on their practical effectiveness in clinical settings?an area vital for the progress of medical artificial intelligence. Objective: Our aim was to assess the capability of ChatGPT versions 3.5 and 4.0 in performing medical history taking and medical record documentation in simulated clinical environments. The study compared the performance of nonmedical individuals using ChatGPT with that of junior medical residents. Methods: A simulation involving standardized patients was designed to mimic authentic medical history?taking interactions. Five nonmedical participants used ChatGPT versions 3.5 and 4.0 to conduct medical histories and document medical records, mirroring the tasks performed by 5 junior residents in identical scenarios. A total of 10 diverse scenarios were examined. Results: Evaluation of the medical documentation created by laypersons with ChatGPT assistance and those created by junior residents was conducted by 2 senior emergency physicians using audio recordings and the final medical records. The assessment used the Objective Structured Clinical Examination benchmarks in Taiwan as a reference. ChatGPT-4.0 exhibited substantial enhancements over its predecessor and met or exceeded the performance of human counterparts in terms of both checklist and global assessment scores. Although the overall quality of human consultations remained higher, ChatGPT-4.0?s proficiency in medical documentation was notably promising. Conclusions: The performance of ChatGPT 4.0 was on par with that of human participants in Objective Structured Clinical Examination evaluations, signifying its potential in medical history and medical record documentation. Despite this, the superiority of human consultations in terms of quality was evident. The study underscores both the promise and the current limitations of LLMs in the realm of clinical practice. UR - https://mededu.jmir.org/2024/1/e59902 UR - http://dx.doi.org/10.2196/59902 ID - info:doi/10.2196/59902 ER - TY - JOUR AU - Zhang, Dandan AU - Chen, Yong-Jun AU - Cui, Tianxin AU - Zhang, Jianzhong AU - Chen, Si-Ying AU - Zhang, Yin-Ping PY - 2024/11/18 TI - Competence and Training Needs in Infectious Disease Emergency Response Among Chinese Nurses: Cross-Sectional Study JO - JMIR Public Health Surveill SP - e62887 VL - 10 KW - competence KW - preparedness KW - infectious disease emergency KW - Chinese KW - nurse KW - cross-sectional study KW - COVID-19 KW - pandemic KW - public health KW - health crises KW - emergency response KW - emergency preparedness KW - medical institution KW - health care worker KW - linear regression N2 - Background: In recent years, the frequent outbreaks of infectious diseases and insufficient emergency response capabilities, particularly issues exposed during the COVID-19 pandemic, have underscored the critical role of nurses in addressing public health crises. It is currently necessary to investigate the emergency preparedness of nursing personnel following the COVID-19 pandemic completely liberalized, aiming to identify weaknesses and optimize response strategies. Objective: This study aimed to assess the emergency response competence of nurses, identify their specific training needs, and explore the various elements that impact their emergency response competence. Methods: Using a multistage stratified sampling method, 5 provinces from different geographical locations nationwide were initially randomly selected using random number tables. Subsequently, within each province, 2 tertiary hospitals, 4 secondary hospitals, and 10 primary hospitals were randomly selected for the survey. The random selection and stratification of the hospitals took into account various aspects such as geographical locations, different levels, scale, and number of nurses. This study involved 80 hospitals (including 10 tertiary hospitals, 20 secondary hospitals, and 50 primary hospitals), where nurses from different departments, specialties, and age groups anonymously completed a questionnaire on infectious disease emergency response capabilities. Results: This study involved 2055 participants representing various health care institutions. The nurses? mean score in infectious disease emergency response competence was 141.75 (SD 20.09), indicating a moderate to above-average level. Nearly one-fifth (n=397, 19.32%) of nurses have experience in responding to infectious disease emergencies; however, they acknowledge a lack of insufficient drills (n=615,29.93%) and training (n=502,24.43%). Notably, 1874 (91.19%) nurses expressed a willingness to undergo further training. Multiple linear regression analysis indicated that significant factors affecting infectious disease emergency response competence included the highest degree, frequency of drills and training, and the willingness to undertake further training (B=?11.455, 7.344, 11.639, 14.432, 10.255, 7.364, and ?11.216; all P<.05). Notably, a higher frequency of participation in drills and training sessions correlated with better outcomes (P<.001 or P<.05). Nurses holding a master degree or higher demonstrated significantly lower competence scores in responding to infectious diseases compared with nurses with a diploma or associate degree (P=.001). Approximately 1644 (80%) of the nurses preferred training lasting from 3 days to 1 week, with scenario simulations and emergency drills considered the most popular training methods. Conclusions: These findings highlight the potential and need for nurses with infectious disease emergency response competence. Frequent drills and training will significantly enhance response competence; however, a lack of practical experience in higher education may have a negative impact on emergency performance. The study emphasizes the critical need for personalized training to boost nurses? abilities, especially through short-term, intensive methods and simulation drills. Further training and tailored plans are essential to improve nurses? overall proficiency and ensure effective responses to infectious disease emergencies. UR - https://publichealth.jmir.org/2024/1/e62887 UR - http://dx.doi.org/10.2196/62887 ID - info:doi/10.2196/62887 ER - TY - JOUR AU - Bicknell, T. Brenton AU - Butler, Danner AU - Whalen, Sydney AU - Ricks, James AU - Dixon, J. Cory AU - Clark, B. Abigail AU - Spaedy, Olivia AU - Skelton, Adam AU - Edupuganti, Neel AU - Dzubinski, Lance AU - Tate, Hudson AU - Dyess, Garrett AU - Lindeman, Brenessa AU - Lehmann, Soleymani Lisa PY - 2024/11/6 TI - ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis JO - JMIR Med Educ SP - e63430 VL - 10 KW - large language model KW - ChatGPT KW - medical education KW - USMLE KW - AI in medical education KW - medical student resources KW - educational technology KW - artificial intelligence in medicine KW - clinical skills KW - LLM KW - medical licensing examination KW - medical students KW - United States Medical Licensing Examination KW - ChatGPT 4 Omni KW - ChatGPT 4 KW - ChatGPT 3.5 N2 - Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models? performances. Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o?s highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o?s diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3?60.3). Conclusions: GPT-4o?s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness. UR - https://mededu.jmir.org/2024/1/e63430 UR - http://dx.doi.org/10.2196/63430 ID - info:doi/10.2196/63430 ER - TY - JOUR AU - Sahyouni, Amal AU - Zoukar, Imad AU - Dashash, Mayssoon PY - 2024/10/28 TI - Evaluating the Effectiveness of an Online Course on Pediatric Malnutrition for Syrian Health Professionals: Qualitative Delphi Study JO - JMIR Med Educ SP - e53151 VL - 10 KW - effectiveness KW - online course KW - pediatric KW - malnutrition KW - essential competencies KW - e-learning KW - health professional KW - Syria KW - pilot study KW - acquisition knowledge N2 - Background: There is a shortage of competent health professionals in managing malnutrition. Online education may be a practical and flexible approach to address this gap. Objective: This study aimed to identify essential competencies and assess the effectiveness of an online course on pediatric malnutrition in improving the knowledge of pediatricians and health professionals. Methods: A focus group (n=5) and Delphi technique (n=21 health professionals) were used to identify 68 essential competencies. An online course consisting of 4 educational modules in Microsoft PowerPoint (Microsoft Corp) slide form with visual aids (photos and videos) was designed and published on the Syrian Virtual University platform website using an asynchronous e-learning system. The course covered definition, classification, epidemiology, anthropometrics, treatment, and consequences. Participants (n=10) completed a pretest of 40 multiple-choice questions, accessed the course, completed a posttest after a specified period, and filled out a questionnaire to measure their attitude and assess their satisfaction. Results: A total of 68 essential competencies were identified, categorized into 3 domains: knowledge (24 competencies), skills (29 competencies), and attitudes (15 competencies). These competencies were further classified based on their focus area: etiology (10 competencies), assessment and diagnosis (21 competencies), and management (37 competencies). Further, 10 volunteers, consisting of 5 pediatricians and 5 health professionals, participated in this study over a 2-week period. A statistically significant increase in knowledge was observed among participants following completion of the online course (pretest mean 24.2, SD 6.1, and posttest mean 35.2, SD 3.3; P<.001). Pediatricians demonstrated higher pre- and posttest scores compared to other health care professionals (all P values were <.05). Prior malnutrition training within the past year positively impacted pretest scores (P=.03). Participants highly rated the course (mean satisfaction score >3.0 on a 5-point Likert scale), with 60% (6/10) favoring a blended learning approach. Conclusions: In total, 68 essential competencies are required for pediatricians to manage children who are malnourished. The online course effectively improved knowledge acquisition among health care professionals, with high participant satisfaction and approval of the e-learning environment. UR - https://mededu.jmir.org/2024/1/e53151 UR - http://dx.doi.org/10.2196/53151 ID - info:doi/10.2196/53151 ER - TY - JOUR AU - Goodings, James Anthony AU - Kajitani, Sten AU - Chhor, Allison AU - Albakri, Ahmad AU - Pastrak, Mila AU - Kodancha, Megha AU - Ives, Rowan AU - Lee, Bin Yoo AU - Kajitani, Kari PY - 2024/10/8 TI - Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study JO - JMIR Med Educ SP - e56128 VL - 10 KW - ChatGPT-4 KW - Family Medicine Board Examination KW - artificial intelligence in medical education KW - AI performance assessment KW - prompt engineering KW - ChatGPT KW - artificial intelligence KW - AI KW - medical education KW - assessment KW - observational KW - analytical method KW - data analysis KW - examination N2 - Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, ?AI Family Medicine Board Exam Taker,? designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI?s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4?s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4?s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. UR - https://mededu.jmir.org/2024/1/e56128 UR - http://dx.doi.org/10.2196/56128 ID - info:doi/10.2196/56128 ER - TY - JOUR AU - Wu, Zelin AU - Gan, Wenyi AU - Xue, Zhaowen AU - Ni, Zhengxin AU - Zheng, Xiaofei AU - Zhang, Yiyi PY - 2024/10/3 TI - Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study JO - JMIR Med Educ SP - e52746 VL - 10 KW - artificial intelligence KW - ChatGPT KW - nursing licensure examination KW - nursing KW - LLMs KW - large language models KW - nursing education KW - AI KW - nursing student KW - large language model KW - licensing KW - observation KW - observational study KW - China KW - USA KW - United States of America KW - auxiliary tool KW - accuracy rate KW - theoretical N2 - Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT?s performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5?s Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making. UR - https://mededu.jmir.org/2024/1/e52746 UR - http://dx.doi.org/10.2196/52746 ID - info:doi/10.2196/52746 ER - TY - JOUR AU - Perotte, Rimma AU - Berns, Alyssa AU - Shaker, Lana AU - Ophaswongse, Chayapol AU - Underwood, Joseph AU - Hajicharalambous, Christina PY - 2024/9/23 TI - Creation of an Automated and Comprehensive Resident Progress System for Residents and to Save Hours of Faculty Time: Mixed Methods Study JO - JMIR Form Res SP - e53314 VL - 8 KW - progress dashboard KW - informatics in medical education KW - residency learning management system KW - residency progress system KW - residency education system KW - summarization KW - administrative burden KW - medical education KW - resident KW - residency KW - resident data KW - longitudinal KW - pilot study KW - competency KW - dashboards KW - dashboard KW - faculty KW - residents N2 - Background: It is vital for residents to have a longitudinal view of their educational progression, and it is crucial for the medical education team to have a clear way to track resident progress over time. Current tools for aggregating resident data are difficult to use and do not provide a comprehensive way to evaluate and display resident educational advancement. Objective: This study aims to describe the creation and assessment of a system designed to improve the longitudinal presentation, quality, and synthesis of educational progress for trainees. We created a new system for residency progress management with 3 goals in mind, that are (1) a long-term and centralized location for residency education data, (2) a clear and intuitive interface that is easy to access for both the residents and faculty involved in medical education, and (3) automated data input, transformation, and analysis. We present evaluations regarding whether residents find the system useful, and whether faculty like the system and perceive that it helps them save time with administrative duties. Methods: The system was created using a suite of Google Workspace tools including Forms, Sheets, Gmail, and a collection of Apps Scripts triggered at various times and events. To assess whether the system had an effect on the residents, we surveyed and asked them to self-report on how often they accessed the system and interviewed them as to whether they found it useful. To understand what the faculty thought of the system, we conducted a 14-person focus group and asked the faculty to self-report their time spent preparing for residency progress meetings before and after the system debut. Results: The system went live in February 2022 as a quality improvement project, evolving through multiple iterations of feedback. The authors found that the system was accessed differently by different postgraduate years (PGY), with the most usage reported in the PGY1 class (weekly), and the least amount of usage in the PGY3 class (once or twice). However, all of the residents reported finding the system useful, specifically for aggregating all of their evaluations in the same place. Faculty members felt that the system enabled a more high-quality biannual clinical competency committee meeting and they reported a combined time savings of 8 hours in preparation for each clinical competency committee as a result of reviewing resident data through the system. Conclusions: Our study reports on the creation of an automated, instantaneous, and comprehensive resident progress management system. The system has been shown to be well-liked by both residents and faculty. Younger PGY classes reported more frequent system usage than older PGY classes. Faculty reported that it helped facilitate more meaningful discussion of training progression and reduced the administrative burden by 8 hours per biannual session. UR - https://formative.jmir.org/2024/1/e53314 UR - http://dx.doi.org/10.2196/53314 UR - http://www.ncbi.nlm.nih.gov/pubmed/39312292 ID - info:doi/10.2196/53314 ER - TY - JOUR AU - Yamamoto, Akira AU - Koda, Masahide AU - Ogawa, Hiroko AU - Miyoshi, Tomoko AU - Maeda, Yoshinobu AU - Otsuka, Fumio AU - Ino, Hideo PY - 2024/9/23 TI - Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial JO - JMIR Med Educ SP - e58753 VL - 10 KW - medical interview KW - generative pretrained transformer KW - large language model KW - simulation-based learning KW - OSCE KW - artificial intelligence KW - medical education KW - simulated patients KW - nonrandomized controlled trial N2 - Background: Medical interviewing is a critical skill in clinical practice, yet opportunities for practical training are limited in Japanese medical schools, necessitating urgent measures. Given advancements in artificial intelligence (AI) technology, its application in the medical field is expanding. However, reports on its application in medical interviews in medical education are scarce. Objective: This study aimed to investigate whether medical students? interview skills could be improved by engaging with AI-simulated patients using large language models, including the provision of feedback. Methods: This nonrandomized controlled trial was conducted with fourth-year medical students in Japan. A simulation program using large language models was provided to 35 students in the intervention group in 2023, while 110 students from 2022 who did not participate in the intervention were selected as the control group. The primary outcome was the score on the Pre-Clinical Clerkship Objective Structured Clinical Examination (pre-CC OSCE), a national standardized clinical skills examination, in medical interviewing. Secondary outcomes included surveys such as the Simulation-Based Training Quality Assurance Tool (SBT-QA10), administered at the start and end of the study. Results: The AI intervention group showed significantly higher scores on medical interviews than the control group (AI group vs control group: mean 28.1, SD 1.6 vs 27.1, SD 2.2; P=.01). There was a trend of inverse correlation between the SBT-QA10 and pre-CC OSCE scores (regression coefficient ?2.0 to ?2.1). No significant safety concerns were observed. Conclusions: Education through medical interviews using AI-simulated patients has demonstrated safety and a certain level of educational effectiveness. However, at present, the educational effects of this platform on nonverbal communication skills are limited, suggesting that it should be used as a supplementary tool to traditional simulation education. UR - https://mededu.jmir.org/2024/1/e58753 UR - http://dx.doi.org/10.2196/58753 UR - http://www.ncbi.nlm.nih.gov/pubmed/39312284 ID - info:doi/10.2196/58753 ER - TY - JOUR AU - Yoon, Soo-Hyuk AU - Oh, Kyeong Seok AU - Lim, Gun Byung AU - Lee, Ho-Jin PY - 2024/9/16 TI - Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study JO - JMIR Med Educ SP - e56859 VL - 10 KW - AI tools KW - problem solving KW - anesthesiology KW - artificial intelligence KW - pain medicine KW - ChatGPT KW - health care KW - medical education KW - South Korea N2 - Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4?s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. UR - https://mededu.jmir.org/2024/1/e56859 UR - http://dx.doi.org/10.2196/56859 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/56859 ER - TY - JOUR AU - Johnsen, Mari Hege AU - Nes, Gonçalves Andréa Aparecida AU - Haddeland, Kristine PY - 2024/9/10 TI - Experiences of Using a Digital Guidance and Assessment Tool (the Technology-Optimized Practice Process in Nursing Application) During Clinical Practice in a Nursing Home: Focus Group Study Among Nursing Students JO - JMIR Nursing SP - e48810 VL - 7 KW - application KW - assessment of clinical education KW - AssCE KW - clinical education assessment tool KW - electronic reports KW - feedback KW - guidance model KW - smartphone KW - Technology-Optimized Practice Process in Nursing KW - TOPP-N KW - information system success model KW - nurse KW - nursing KW - allied health KW - education KW - focus group KW - focus groups KW - technology enhanced learning KW - digital health KW - content analysis KW - student KW - students KW - nursing home KW - long-term care KW - learning management KW - mobile phone N2 - Background: Nursing students? learning during clinical practice is largely influenced by the quality of the guidance they receive from their nurse preceptors. Students that have attended placement in nursing home settings have called for more time with nurse preceptors and an opportunity for more help from the nurses for reflection and developing critical thinking skills. To strengthen students? guidance and assessment and enhance students? learning in the practice setting, it has also been recommended to improve the collaboration between faculties and nurse preceptors. Objective: This study explores first-year nursing students? experiences of using the Technology-Optimized Practice Process in Nursing (TOPP-N) application in 4 nursing homes in Norway. TOPP-N was developed to support guidance and assessment in clinical practice in nursing education. Methods: Four focus groups were conducted with 19 nursing students from 2 university campuses in Norway. The data collection and directed content analysis were based on DeLone and McLean?s information system success model. Results: Some participants had difficulties learning to use the TOPP-N tool, particularly those who had not attended the 1-hour digital course. Furthermore, participants remarked that the content of the TOPP-N guidance module could be better adjusted to the current clinical placement, level of education, and individual achievements to be more usable. Despite this, most participants liked the TOPP-N application?s concept. Using the TOPP-N mobile app for guidance and assessment was found to be very flexible. The frequency and ways of using the application varied among the participants. Most participants perceived that the use of TOPP-N facilitated awareness of learning objectives and enabled continuous reflection and feedback from nurse preceptors. However, the findings indicate that the TOPP-N application?s perceived usefulness was highly dependent on the preparedness and use of the app among nurse preceptors (or absence thereof). Conclusions: This study offers information about critical success factors perceived by nursing students related to the use of the TOPP-N application. To develop similar learning management systems that are usable and efficient, developers should focus on personalizing the content, clarifying procedures for use, and enhancing the training and motivation of users, that is, students, nurse preceptors, and educators. UR - https://nursing.jmir.org/2024/1/e48810 UR - http://dx.doi.org/10.2196/48810 UR - http://www.ncbi.nlm.nih.gov/pubmed/39255477 ID - info:doi/10.2196/48810 ER - TY - JOUR AU - Thomae, V. Anita AU - Witt, M. Claudia AU - Barth, Jürgen PY - 2024/8/22 TI - Integration of ChatGPT Into a Course for Medical Students: Explorative Study on Teaching Scenarios, Students? Perception, and Applications JO - JMIR Med Educ SP - e50545 VL - 10 KW - medical education KW - ChatGPT KW - artificial intelligence KW - information for patients KW - critical appraisal KW - evaluation KW - blended learning KW - AI KW - digital skills KW - teaching N2 - Background: Text-generating artificial intelligence (AI) such as ChatGPT offers many opportunities and challenges in medical education. Acquiring practical skills necessary for using AI in a clinical context is crucial, especially for medical education. Objective: This explorative study aimed to investigate the feasibility of integrating ChatGPT into teaching units and to evaluate the course and the importance of AI-related competencies for medical students. Since a possible application of ChatGPT in the medical field could be the generation of information for patients, we further investigated how such information is perceived by students in terms of persuasiveness and quality. Methods: ChatGPT was integrated into 3 different teaching units of a blended learning course for medical students. Using a mixed methods approach, quantitative and qualitative data were collected. As baseline data, we assessed students? characteristics, including their openness to digital innovation. The students evaluated the integration of ChatGPT into the course and shared their thoughts regarding the future of text-generating AI in medical education. The course was evaluated based on the Kirkpatrick Model, with satisfaction, learning progress, and applicable knowledge considered as key assessment levels. In ChatGPT-integrating teaching units, students evaluated videos featuring information for patients regarding their persuasiveness on treatment expectations in a self-experience experiment and critically reviewed information for patients written using ChatGPT 3.5 based on different prompts. Results: A total of 52 medical students participated in the study. The comprehensive evaluation of the course revealed elevated levels of satisfaction, learning progress, and applicability specifically in relation to the ChatGPT-integrating teaching units. Furthermore, all evaluation levels demonstrated an association with each other. Higher openness to digital innovation was associated with higher satisfaction and, to a lesser extent, with higher applicability. AI-related competencies in other courses of the medical curriculum were perceived as highly important by medical students. Qualitative analysis highlighted potential use cases of ChatGPT in teaching and learning. In ChatGPT-integrating teaching units, students rated information for patients generated using a basic ChatGPT prompt as ?moderate? in terms of comprehensibility, patient safety, and the correct application of communication rules taught during the course. The students? ratings were considerably improved using an extended prompt. The same text, however, showed the smallest increase in treatment expectations when compared with information provided by humans (patient, clinician, and expert) via videos. Conclusions: This study offers valuable insights into integrating the development of AI competencies into a blended learning course. Integration of ChatGPT enhanced learning experiences for medical students. UR - https://mededu.jmir.org/2024/1/e50545 UR - http://dx.doi.org/10.2196/50545 ID - info:doi/10.2196/50545 ER - TY - JOUR AU - Gan, Wenyi AU - Ouyang, Jianfeng AU - Li, Hua AU - Xue, Zhaowen AU - Zhang, Yiming AU - Dong, Qiu AU - Huang, Jiadong AU - Zheng, Xiaofei AU - Zhang, Yiyi PY - 2024/8/20 TI - Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial JO - J Med Internet Res SP - e57037 VL - 26 KW - ChatGPT KW - medical education KW - orthopedics KW - artificial intelligence KW - large language model KW - natural language processing KW - randomized controlled trial KW - learning aid N2 - Background: ChatGPT is a natural language processing model developed by OpenAI, which can be iteratively updated and optimized to accommodate the changing and complex requirements of human verbal communication. Objective: The study aimed to evaluate ChatGPT?s accuracy in answering orthopedics-related multiple-choice questions (MCQs) and assess its short-term effects as a learning aid through a randomized controlled trial. In addition, long-term effects on student performance in other subjects were measured using final examination results. Methods: We first evaluated ChatGPT?s accuracy in answering MCQs pertaining to orthopedics across various question formats. Then, 129 undergraduate medical students participated in a randomized controlled study in which the ChatGPT group used ChatGPT as a learning tool, while the control group was prohibited from using artificial intelligence software to support learning. Following a 2-week intervention, the 2 groups? understanding of orthopedics was assessed by an orthopedics test, and variations in the 2 groups? performance in other disciplines were noted through a follow-up at the end of the semester. Results: ChatGPT-4.0 answered 1051 orthopedics-related MCQs with a 70.60% (742/1051) accuracy rate, including 71.8% (237/330) accuracy for A1 MCQs, 73.7% (330/448) accuracy for A2 MCQs, 70.2% (92/131) accuracy for A3/4 MCQs, and 58.5% (83/142) accuracy for case analysis MCQs. As of April 7, 2023, a total of 129 individuals participated in the experiment. However, 19 individuals withdrew from the experiment at various phases; thus, as of July 1, 2023, a total of 110 individuals accomplished the trial and completed all follow-up work. After we intervened in the learning style of the students in the short term, the ChatGPT group answered more questions correctly than the control group (ChatGPT group: mean 141.20, SD 26.68; control group: mean 130.80, SD 25.56; P=.04) in the orthopedics test, particularly on A1 (ChatGPT group: mean 46.57, SD 8.52; control group: mean 42.18, SD 9.43; P=.01), A2 (ChatGPT group: mean 60.59, SD 10.58; control group: mean 56.66, SD 9.91; P=.047), and A3/4 MCQs (ChatGPT group: mean 19.57, SD 5.48; control group: mean 16.46, SD 4.58; P=.002). At the end of the semester, we found that the ChatGPT group performed better on final examinations in surgery (ChatGPT group: mean 76.54, SD 9.79; control group: mean 72.54, SD 8.11; P=.02) and obstetrics and gynecology (ChatGPT group: mean 75.98, SD 8.94; control group: mean 72.54, SD 8.66; P=.04) than the control group. Conclusions: ChatGPT answers orthopedics-related MCQs accurately, and students using it excel in both short-term and long-term assessments. Our findings strongly support ChatGPT?s integration into medical education, enhancing contemporary instructional methods. Trial Registration: Chinese Clinical Trial Registry Chictr2300071774; https://www.chictr.org.cn/hvshowproject.html ?id=225740&v=1.0 UR - https://www.jmir.org/2024/1/e57037 UR - http://dx.doi.org/10.2196/57037 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/57037 ER - TY - JOUR AU - Ming, Shuai AU - Guo, Qingge AU - Cheng, Wenjun AU - Lei, Bo PY - 2024/8/13 TI - Influence of Model Evolution and System Roles on ChatGPT?s Performance in Chinese Medical Licensing Exams: Comparative Study JO - JMIR Med Educ SP - e52784 VL - 10 KW - ChatGPT KW - Chinese National Medical Licensing Examination KW - large language models KW - medical education KW - system role KW - LLM KW - LLMs KW - language model KW - language models KW - artificial intelligence KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents KW - exam KW - exams KW - examination KW - examinations KW - OpenAI KW - answer KW - answers KW - response KW - responses KW - accuracy KW - performance KW - China KW - Chinese N2 - Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt?s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The ?2 tests and ? values were employed to evaluate the model?s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with ? values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%?3.7%) and GPT-3.5 (1.3%?4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model?s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. UR - https://mededu.jmir.org/2024/1/e52784 UR - http://dx.doi.org/10.2196/52784 ID - info:doi/10.2196/52784 ER - TY - JOUR AU - Burke, B. Harry AU - Hoang, Albert AU - Lopreiato, O. Joseph AU - King, Heidi AU - Hemmer, Paul AU - Montgomery, Michael AU - Gagarin, Viktoria PY - 2024/7/25 TI - Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study JO - JMIR Med Educ SP - e56342 VL - 10 KW - medical education KW - generative artificial intelligence KW - natural language processing KW - ChatGPT KW - generative pretrained transformer KW - standardized patients KW - clinical notes KW - free-text notes KW - history and physical examination KW - large language model KW - LLM KW - medical student KW - medical students KW - clinical information KW - artificial intelligence KW - AI KW - patients KW - patient KW - medicine N2 - Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students? free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students? notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students? standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. UR - https://mededu.jmir.org/2024/1/e56342 UR - http://dx.doi.org/10.2196/56342 ID - info:doi/10.2196/56342 ER - TY - JOUR AU - Cherif, Hela AU - Moussa, Chirine AU - Missaoui, Mouhaymen Abdel AU - Salouage, Issam AU - Mokaddem, Salma AU - Dhahri, Besma PY - 2024/7/23 TI - Appraisal of ChatGPT?s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination JO - JMIR Med Educ SP - e52818 VL - 10 KW - medical education KW - ChatGPT KW - GPT KW - artificial intelligence KW - natural language processing KW - NLP KW - pulmonary medicine KW - pulmonary KW - lung KW - lungs KW - respiratory KW - respiration KW - pneumology KW - comparative analysis KW - large language models KW - LLMs KW - LLM KW - language model KW - generative AI KW - generative artificial intelligence KW - generative KW - exams KW - exam KW - examinations KW - examination N2 - Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT?s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution?s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. UR - https://mededu.jmir.org/2024/1/e52818 UR - http://dx.doi.org/10.2196/52818 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/52818 ER - TY - JOUR AU - Rössler, Lena AU - Herrmann, Manfred AU - Wiegand, Annette AU - Kanzow, Philipp PY - 2024/6/27 TI - Use of Multiple-Choice Items in Summative Examinations: Questionnaire Survey Among German Undergraduate Dental Training Programs JO - JMIR Med Educ SP - e58126 VL - 10 KW - alternate-choice KW - assessment KW - best-answer KW - dental KW - dental schools KW - dental training KW - education KW - educational assessment KW - educational measurement KW - examination KW - German KW - Germany KW - k of n KW - Kprim KW - K? KW - medical education KW - medical student KW - MTF KW - Multiple-True-False KW - multiple choice KW - multiple-select KW - Pick-N KW - scoring KW - scoring system KW - single choice KW - single response KW - test KW - testing KW - true/false KW - true-false KW - Type A KW - Type K KW - Type K? KW - Type R KW - Type X KW - undergraduate KW - undergraduate curriculum KW - undergraduate education N2 - Background: Multiple-choice examinations are frequently used in German dental schools. However, details regarding the used item types and applied scoring methods are lacking. Objective: This study aims to gain insight into the current use of multiple-choice items (ie, questions) in summative examinations in German undergraduate dental training programs. Methods: A paper-based 10-item questionnaire regarding the used assessment methods, multiple-choice item types, and applied scoring methods was designed. The pilot-tested questionnaire was mailed to the deans of studies and to the heads of the Department of Operative/Restorative Dentistry at all 30 dental schools in Germany in February 2023. Statistical analysis was performed using the Fisher exact test (P<.05). Results: The response rate amounted to 90% (27/30 dental schools). All respondent dental schools used multiple-choice examinations for summative assessments. Examinations were delivered electronically by 70% (19/27) of the dental schools. Almost all dental schools used single-choice Type A items (24/27, 89%), which accounted for the largest number of items in approximately half of the dental schools (13/27, 48%). Further item types (eg, conventional multiple-select items, Multiple-True-False, and Pick-N) were only used by fewer dental schools (?67%, up to 18 out of 27 dental schools). For the multiple-select item types, the applied scoring methods varied considerably (ie, awarding [intermediate] partial credit and requirements for partial credit). Dental schools with the possibility of electronic examinations used multiple-select items slightly more often (14/19, 74% vs 4/8, 50%). However, this difference was statistically not significant (P=.38). Dental schools used items either individually or as key feature problems consisting of a clinical case scenario followed by a number of items focusing on critical treatment steps (15/27, 56%). Not a single school used alternative testing methods (eg, answer-until-correct). A formal item review process was established at about half of the dental schools (15/27, 56%). Conclusions: Summative assessment methods among German dental schools vary widely. Especially, a large variability regarding the use and scoring of multiple-select multiple-choice items was found. UR - https://mededu.jmir.org/2024/1/e58126 UR - http://dx.doi.org/10.2196/58126 ID - info:doi/10.2196/58126 ER - TY - JOUR AU - Sekhar, C. Tejas AU - Nayak, R. Yash AU - Abdoler, A. Emily PY - 2024/6/7 TI - A Use Case for Generative AI in Medical Education JO - JMIR Med Educ SP - e56117 VL - 10 KW - medical education KW - med ed KW - generative artificial intelligence KW - artificial intelligence KW - GAI KW - AI KW - Anki KW - flashcard KW - undergraduate medical education KW - UME UR - https://mededu.jmir.org/2024/1/e56117 UR - http://dx.doi.org/10.2196/56117 ID - info:doi/10.2196/56117 ER - TY - JOUR AU - Pendergrast, Tricia AU - Chalmers, Zachary PY - 2024/6/7 TI - Authors? Reply: A Use Case for Generative AI in Medical Education JO - JMIR Med Educ SP - e58370 VL - 10 KW - ChatGPT KW - undergraduate medical education KW - large language models UR - https://mededu.jmir.org/2024/1/e58370 UR - http://dx.doi.org/10.2196/58370 ID - info:doi/10.2196/58370 ER - TY - JOUR AU - Lambert, Raphaella AU - Choo, Zi-Yi AU - Gradwohl, Kelsey AU - Schroedl, Liesl AU - Ruiz De Luzuriaga, Arlene PY - 2024/5/16 TI - Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study JO - JMIR Dermatol SP - e55898 VL - 7 KW - artificial intelligence KW - large language models KW - large language model KW - LLM KW - LLMs KW - machine learning KW - natural language processing KW - deep learning KW - ChatGPT KW - health literacy KW - health knowledge KW - health information KW - patient education KW - dermatology KW - dermatologist KW - dermatologists KW - derm KW - dermatology resident KW - dermatology residents KW - dermatologic patient education material KW - dermatologic patient education materials KW - patient education material KW - patient education materials KW - education material KW - education materials N2 - Background: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels. Objective: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees. Methods: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to ?Create a patient education handout about [condition] at a [FKRL]? to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees. Results: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%). Conclusions: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology. UR - https://derma.jmir.org/2024/1/e55898 UR - http://dx.doi.org/10.2196/55898 UR - http://www.ncbi.nlm.nih.gov/pubmed/38754096 ID - info:doi/10.2196/55898 ER - TY - JOUR AU - Friche, Pauline AU - Moulis, Lionel AU - Du Thanh, Aurélie AU - Dereure, Olivier AU - Duflos, Claire AU - Carbonnel, Francois PY - 2024/5/13 TI - Training Family Medicine Residents in Dermoscopy Using an e-Learning Course: Pilot Interventional Study JO - JMIR Form Res SP - e56005 VL - 8 KW - dermoscopy KW - dermatoscope KW - dermatoscopes KW - dermatological KW - skin KW - training KW - GP KW - family practitioner KW - family practitioners KW - family physician KW - family physicians KW - general practice KW - family medicine KW - primary health care KW - internship and residency KW - education KW - e-learning KW - eLearning KW - dermatology KW - resident KW - residency KW - intern KW - interns KW - internship KW - internships N2 - Background: Skin cancers are the most common group of cancers diagnosed worldwide. Aging and sun exposure increase their risk. The decline in the number of dermatologists is pushing the issue of dermatological screening back onto family doctors. Dermoscopy is an easy-to-use tool that increases the sensitivity of melanoma diagnosis by 60% to 90%, but its use is limited due to lack of training. The characteristics of ?ideal? dermoscopy training have yet to be established. We created a Moodle (Moodle HQ)-based e-learning course to train family medicine residents in dermoscopy. Objective: This study aimed to evaluate the evolution of dermoscopy knowledge among family doctors immediately and 1 and 3 months after e-learning training. Methods: We conducted a prospective interventional study between April and November 2020 to evaluate an educational program intended for family medicine residents at the University of Montpellier-Nîmes, France. They were asked to complete an e-learning course consisting of 2 modules, with an assessment quiz repeated at 1 (M1) and 3 months (M3). The course was based on a 2-step algorithm, a method of dermoscopic analysis of pigmented skin lesions that is internationally accepted. The objectives of modules 1 and 2 were to differentiate melanocytic lesions from nonmelanocytic lesions and to precisely identify skin lesions by looking for dermoscopic morphological criteria specific to each lesion. Each module consisted of 15 questions with immediate feedback after each question. Results: In total, 134 residents were included, and 66.4% (n=89) and 47% (n=63) of trainees fully participated in the evaluation of module 1 and module 2, respectively. This study showed a significant score improvement 3 months after the training course in 92.1% (n=82) of participants for module 1 and 87.3% (n=55) of participants for module 2 (P<.001). The majority of the participants expressed satisfaction (n=48, 90.6%) with the training course, and 96.3% (n=51) planned to use a dermatoscope in their future practice. Regarding final scores, the only variable that was statistically significant was the resident?s initial scores (P=.003) for module 1. No measured variable was found to be associated with retention (midtraining or final evaluation) for module 2. Residents who had completed at least 1 dermatology rotation during medical school had significantly higher initial scores in module 1 at M0 (P=.03). Residents who reported having completed at least 1 dermatology rotation during their family medicine training had a statistically significant higher score at M1 for module 1 and M3 for module 2 (P=.01 and P=.001). Conclusions: The integration of an e-learning training course in dermoscopy into the curriculum of FM residents results in a significant improvement in their diagnosis skills and meets their expectations. Developing a program combining an e-learning course and face-to-face training for residents is likely to result in more frequent and effective dermoscopy use by family doctors. UR - https://formative.jmir.org/2024/1/e56005 UR - http://dx.doi.org/10.2196/56005 UR - http://www.ncbi.nlm.nih.gov/pubmed/38739910 ID - info:doi/10.2196/56005 ER - TY - JOUR AU - Rojas, Marcos AU - Rojas, Marcelo AU - Burgess, Valentina AU - Toro-Pérez, Javier AU - Salehi, Shima PY - 2024/4/29 TI - Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study JO - JMIR Med Educ SP - e55048 VL - 10 KW - artificial intelligence KW - AI KW - generative artificial intelligence KW - medical education KW - ChatGPT KW - EUNACOM KW - medical licensure KW - medical license KW - medical licensing exam N2 - Background: The deployment of OpenAI?s ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as ?GPT-4 Turbo With Vision?), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile?s medical licensing examinations?a critical step for medical practitioners in Chile?is less explored. This gap highlights the need to evaluate ChatGPT?s adaptability to diverse linguistic and cultural contexts. Objective: This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile. Methods: Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM?s structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. Results: All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT?s performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%). Conclusions: This study reveals ChatGPT?s ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals. UR - https://mededu.jmir.org/2024/1/e55048 UR - http://dx.doi.org/10.2196/55048 ID - info:doi/10.2196/55048 ER - TY - JOUR AU - Noda, Masao AU - Ueno, Takayoshi AU - Koshu, Ryota AU - Takaso, Yuji AU - Shimada, Dias Mari AU - Saito, Chizu AU - Sugimoto, Hisashi AU - Fushiki, Hiroaki AU - Ito, Makoto AU - Nomura, Akihiro AU - Yoshizaki, Tomokazu PY - 2024/3/28 TI - Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study JO - JMIR Med Educ SP - e57054 VL - 10 KW - artificial intelligence KW - GPT-4v KW - large language model KW - otolaryngology KW - GPT KW - ChatGPT KW - LLM KW - LLMs KW - language model KW - language models KW - head KW - respiratory KW - ENT: ear KW - nose KW - throat KW - neck KW - NLP KW - natural language processing KW - image KW - images KW - exam KW - exams KW - examination KW - examinations KW - answer KW - answers KW - answering KW - response KW - responses N2 - Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence?s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. UR - https://mededu.jmir.org/2024/1/e57054 UR - http://dx.doi.org/10.2196/57054 UR - http://www.ncbi.nlm.nih.gov/pubmed/38546736 ID - info:doi/10.2196/57054 ER - TY - JOUR AU - Shikino, Kiyoshi AU - Nishizaki, Yuji AU - Fukui, Sho AU - Yokokawa, Daiki AU - Yamamoto, Yu AU - Kobayashi, Hiroyuki AU - Shimizu, Taro AU - Tokuda, Yasuharu PY - 2024/2/29 TI - Development of a Clinical Simulation Video to Evaluate Multiple Domains of Clinical Competence: Cross-Sectional Study JO - JMIR Med Educ SP - e54401 VL - 10 KW - discrimination index KW - General Medicine In-Training Examination KW - clinical simulation video KW - postgraduate medical education KW - video KW - videos KW - training KW - examination KW - examinations KW - medical education KW - resident KW - residents KW - postgraduate KW - postgraduates KW - simulation KW - simulations KW - diagnosis KW - diagnoses KW - diagnose KW - general medicine KW - general practice KW - general practitioner KW - skill KW - skills N2 - Background: Medical students in Japan undergo a 2-year postgraduate residency program to acquire clinical knowledge and general medical skills. The General Medicine In-Training Examination (GM-ITE) assesses postgraduate residents? clinical knowledge. A clinical simulation video (CSV) may assess learners? interpersonal abilities. Objective: This study aimed to evaluate the relationship between GM-ITE scores and resident physicians? diagnostic skills by having them watch a CSV and to explore resident physicians? perceptions of the CSV?s realism, educational value, and impact on their motivation to learn. Methods: The participants included 56 postgraduate medical residents who took the GM-ITE between January 21 and January 28, 2021; watched the CSV; and then provided a diagnosis. The CSV and GM-ITE scores were compared, and the validity of the simulations was examined using discrimination indices, wherein ?0.20 indicated high discriminatory power and >0.40 indicated a very good measure of the subject?s qualifications. Additionally, we administered an anonymous questionnaire to ascertain participants? views on the realism and educational value of the CSV and its impact on their motivation to learn. Results: Of the 56 participants, 6 (11%) provided the correct diagnosis, and all were from the second postgraduate year. All domains indicated high discriminatory power. The (anonymous) follow-up responses indicated that the CSV format was more suitable than the conventional GM-ITE for assessing clinical competence. The anonymous survey revealed that 12 (52%) participants found the CSV format more suitable than the GM-ITE for assessing clinical competence, 18 (78%) affirmed the realism of the video simulation, and 17 (74%) indicated that the experience increased their motivation to learn. Conclusions: The findings indicated that CSV modules simulating real-world clinical examinations were successful in assessing examinees? clinical competence across multiple domains. The study demonstrated that the CSV not only augmented the assessment of diagnostic skills but also positively impacted learners? motivation, suggesting a multifaceted role for simulation in medical education. UR - https://mededu.jmir.org/2024/1/e54401 UR - http://dx.doi.org/10.2196/54401 UR - http://www.ncbi.nlm.nih.gov/pubmed/38421691 ID - info:doi/10.2196/54401 ER - TY - JOUR AU - Meyer, Annika AU - Riese, Janik AU - Streichert, Thomas PY - 2024/2/8 TI - Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study JO - JMIR Med Educ SP - e50965 VL - 10 KW - ChatGPT KW - artificial intelligence KW - large language model KW - medical exams KW - medical examinations KW - medical education KW - LLM KW - public trust KW - trust KW - medical accuracy KW - licensing exam KW - licensing examination KW - improvement KW - patient care KW - general population KW - licensure examination N2 - Background: The potential of artificial intelligence (AI)?based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods: To assess GPT-3.5?s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions: The study results highlight ChatGPT?s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4?s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population. UR - https://mededu.jmir.org/2024/1/e50965 UR - http://dx.doi.org/10.2196/50965 UR - http://www.ncbi.nlm.nih.gov/pubmed/38329802 ID - info:doi/10.2196/50965 ER - TY - JOUR AU - Person, Cheryl AU - O'Connor, Nicola AU - Koehler, Lucy AU - Venkatachalam, Kartik AU - Gaveras, Georgia PY - 2023/12/8 TI - Evaluating Clinical Outcomes in Patients Being Treated Exclusively via Telepsychiatry: Retrospective Data Analysis JO - JMIR Form Res SP - e53293 VL - 7 KW - telepsychiatry KW - PHQ-8 KW - GAD-7 KW - clinical outcomes KW - rural KW - commercial insurance KW - telehealth KW - depression KW - anxiety KW - telemental health KW - psychiatry KW - Generalized Anxiety Disorder-7 KW - Patient Health Questionnaire-8 N2 - Background: Depression and anxiety are highly prevalent conditions in the United States. Despite the availability of suitable therapeutic options, limited access to high-quality psychiatrists represents a major barrier to treatment. Although telepsychiatry has the potential to improve access to psychiatrists, treatment efficacy in the telepsychiatry model remains unclear. Objective: Our primary objective was to determine whether there was a clinically meaningful change in 1 of 2 validated outcome measures of depression and anxiety?the Patient Health Questionnaire?8 (PHQ-8) or the Generalized Anxiety Disorder?7 (GAD-7)?after receiving at least 8 weeks of treatment in an outpatient telepsychiatry setting. Methods: We included treatment-seeking patients enrolled in a large outpatient telepsychiatry service that accepts commercial insurance. All analyzed patients completed the GAD-7 and PHQ-8 prior to their first appointment and at least once after 8 weeks of treatment. Treatments included comprehensive diagnostic evaluation, supportive psychotherapy, and medication management. Results: In total, 1826 treatment-seeking patients were evaluated for clinically meaningful changes in GAD-7 and PHQ-8 scores during treatment. Mean treatment duration was 103 (SD 34) days. At baseline, 58.8% (1074/1826) and 60.1% (1097/1826) of patients exhibited at least moderate anxiety and depression, respectively. In response to treatment, mean change for GAD-7 was ?6.71 (95% CI ?7.03 to ?6.40) and for PHQ-8 was ?6.85 (95% CI ?7.18 to ?6.52). Patients with at least moderate symptoms at baseline showed a 45.7% reduction in GAD-7 scores and a 43.1% reduction in PHQ-8 scores. Effect sizes for GAD-7 and PHQ-8, as measured by Cohen d for paired samples, were d=1.30 (P<.001) and d=1.23 (P<.001), respectively. Changes in GAD-7 and PHQ-8 scores correlated with the type of insurance held by the patients. Greatest reductions in scores were observed among patients with commercial insurance (45% and 43.9% reductions in GAD-7 and PHQ-8 scores, respectively). Although patients with Medicare did exhibit statistically significant reductions in GAD-7 and PHQ-8 scores from baseline (P<.001), these improvements were attenuated compared to those in patients with commercial insurance (29.2% and 27.6% reduction in GAD-7 and PHQ-8 scores, respectively). Pairwise comparison tests revealed significant differences in treatment responses in patients with Medicare versus commercial insurance (P<.001). Responses were independent of patient geographic classification (urban vs rural; P=.48 for GAD-7 and P=.07 for PHQ-8). The finding that treatment efficacy was comparable among rural and urban patients indicated that telepsychiatry is a promising approach to overcome treatment disparities that stem from geographical constraints. Conclusions: In this large retrospective data analysis of treatment-seeking patients using a telepsychiatry platform, we found robust and clinically significant improvement in depression and anxiety symptoms during treatment. The results provide further evidence that telepsychiatry is highly effective and has the potential to improve access to psychiatric care. UR - https://formative.jmir.org/2023/1/e53293 UR - http://dx.doi.org/10.2196/53293 UR - http://www.ncbi.nlm.nih.gov/pubmed/37991899 ID - info:doi/10.2196/53293 ER - TY - JOUR AU - Stevens, Kathleen AU - Moralejo, Donna AU - Crossman, Renee PY - 2023/10/18 TI - Evaluation of Incremental Validity of Casper in Predicting Program and National Licensure Performance of Undergraduate Nursing Students: Protocol for a Mixed Methods Study JO - JMIR Res Protoc SP - e48672 VL - 12 KW - communication KW - empathy KW - incremental validity KW - mixed methods KW - nursing school admissions KW - problem-solving KW - professionalism KW - situational judgement testing KW - undergraduate nursing students N2 - Background: Academic success has been the primary criterion for admission to many nursing programs. However, academic success as an admission criterion may have limited predictive value for success in noncognitive skills. Adding situational judgment tests, such as Casper, to admissions procedures may be one strategy to strengthen decisions and address the limited predictive value of academic admission criteria. In 2021, admissions processes were modified to include Casper based on concerns identified with noncognitive skills. Objective: This study aims to (1) assess the incremental validity of Casper scores in predicting nursing student performance at years 1, 2, 3, and 4 and on the National Council Licensing Examination (NCLEX) performance; and (2) examine faculty members? perceptions of student performance and influences related to communication, professionalism, empathy, and problem-solving. Methods: We will use a multistage evaluation mixed methods design with 5 phases. At the end of each year, students will complete questionnaires related to empathy and professionalism and have their performance assessed for communication and problem-solving in psychomotor laboratory sessions. The final phase will assess graduate performance on the NCLEX. Each phase also includes qualitative data collection (ie, focus groups with faculty members). The goal of the focus groups is to help explain the quantitative findings (explanatory phase) as well as inform data collection (eg, focus group questions) in the subsequent phase (exploratory sequence). All students enrolled in the first year of the nursing program in 2021 were asked to participate (n=290). Faculty will be asked to participate in the focus groups at the end of each year of the program. Hierarchical multiple regression will be conducted for each outcome of interest (eg, communication, professionalism, empathy, and problem-solving) to determine the extent to which scores on Casper with admission grades, compared to admission grades alone, predict nursing student performance at years 1-4 of the program and success on the national exam. Thematic analysis of focus group transcripts will be conducted using interpretive description. The quantitative and qualitative data will be integrated after each phase is complete and at the end of the study. Results: This study was funded in September 2021, and data collection began in March 2022. Year 1 data collection and analysis are complete. Year 2 data collection is complete, and data analysis is in progress. Conclusions: At the end of the study, we will provide the results of a comprehensive analysis to determine the extent to which the addition of scores on Casper compared to admission grades alone predicts nursing student performance at years 1-4 of the program and on the NCLEX exam. International Registered Report Identifier (IRRID): RR1-10.2196/48672 UR - https://www.researchprotocols.org/2023/1/e48672 UR - http://dx.doi.org/10.2196/48672 UR - http://www.ncbi.nlm.nih.gov/pubmed/37851504 ID - info:doi/10.2196/48672 ER - TY - JOUR AU - Yanagita, Yasutaka AU - Yokokawa, Daiki AU - Uchida, Shun AU - Tawara, Junsuke AU - Ikusaka, Masatomi PY - 2023/10/13 TI - Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study JO - JMIR Form Res SP - e48023 VL - 7 KW - artificial intelligence KW - ChatGPT KW - GPT-4 KW - AI KW - National Medical Licensing Examination KW - Japanese KW - NMLE N2 - Background: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT?s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. Objective: The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. Methods: Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. Results: Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. Conclusions: GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information. UR - https://formative.jmir.org/2023/1/e48023 UR - http://dx.doi.org/10.2196/48023 UR - http://www.ncbi.nlm.nih.gov/pubmed/37831496 ID - info:doi/10.2196/48023 ER - TY - JOUR AU - Huang, ST Ryan AU - Lu, Qi Kevin Jia AU - Meaney, Christopher AU - Kemppainen, Joel AU - Punnett, Angela AU - Leung, Fok-Han PY - 2023/9/19 TI - Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study JO - JMIR Med Educ SP - e50514 VL - 9 KW - medical education KW - medical knowledge exam KW - artificial intelligence KW - AI KW - natural language processing KW - NLP KW - large language model KW - LLM KW - machine learning, ChatGPT KW - GPT-3.5 KW - GPT-4 KW - education KW - language model KW - education examination KW - testing KW - utility KW - family medicine KW - medical residents KW - test KW - community N2 - Background: Large language model (LLM)?based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. Objective: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. Methods: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot?s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. Results: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). Conclusions: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services. UR - https://mededu.jmir.org/2023/1/e50514 UR - http://dx.doi.org/10.2196/50514 UR - http://www.ncbi.nlm.nih.gov/pubmed/37725411 ID - info:doi/10.2196/50514 ER - TY - JOUR AU - Almansour, Amal AU - Montague, Enid AU - Furst, Jacob AU - Raicu, Daniela PY - 2023/9/8 TI - Evaluation of Eye Gaze Dynamics During Physician-Patient-Computer Interaction in Federally Qualified Health Centers: Systematic Analysis JO - JMIR Hum Factors SP - e46120 VL - 10 KW - patient-physician-computer interaction KW - nonverbal communication KW - Federally Qualified Health Centers KW - primary care encounter N2 - Background: Understanding the communication between physicians and patients can identify areas where they can improve and build stronger relationships. This led to better patient outcomes including increased engagement, enhanced adherence to treatment plan, and a boost in trust. Objective: This study investigates eye gaze directions of physicians, patients, and computers in naturalistic medical encounters at Federally Qualified Health Centers to understand communication patterns given different patients? diverse backgrounds. The aim is to support the building and designing of health information technologies, which will facilitate the improvement of patient outcomes. Methods: Data were obtained from 77 videotaped medical encounters in 2014 from 3 Federally Qualified Health Centers in Chicago, Illinois, that included 11 physicians and 77 patients. Self-reported surveys were collected from physicians and patients. A systematic analysis approach was used to thoroughly examine and analyze the data. The dynamics of eye gazes during interactions between physicians, patients, and computers were evaluated using the lag sequential analysis method. The objective of the study was to identify significant behavior patterns from the 6 predefined patterns initiated by both physicians and patients. The association between eye gaze patterns was examined using the Pearson chi-square test and the Yule Q test. Results: The results of the lag sequential method showed that 3 out of 6 doctor-initiated gaze patterns were followed by patient-response gaze patterns. Moreover, 4 out of 6 patient-initiated patterns were significantly followed by doctor-response gaze patterns. Unlike the findings in previous studies, doctor-initiated eye gaze behavior patterns were not leading patients? eye gaze. Moreover, patient-initiated eye gaze behavior patterns were significant in certain circumstances, particularly when interacting with physicians. Conclusions: This study examined several physician-patient-computer interaction patterns in naturalistic settings using lag sequential analysis. The data indicated a significant influence of the patients? gazes on physicians. The findings revealed that physicians demonstrated a higher tendency to engage with patients by reciprocating the patient?s eye gaze when the patient looked at them. However, the reverse pattern was not observed, suggesting a lack of reciprocal gaze from patients toward physicians and a tendency to not direct their gaze toward a specific object. Furthermore, patients exhibited a preference for the computer when physicians directed their eye gaze toward it. UR - https://humanfactors.jmir.org/2023/1/e46120 UR - http://dx.doi.org/10.2196/46120 UR - http://www.ncbi.nlm.nih.gov/pubmed/37682590 ID - info:doi/10.2196/46120 ER - TY - JOUR AU - Gilson, Aidan AU - Safranek, W. Conrad AU - Huang, Thomas AU - Socrates, Vimig AU - Chi, Ling AU - Taylor, Andrew Richard AU - Chartash, David PY - 2023/7/13 TI - Authors? Reply to: Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations JO - JMIR Med Educ SP - e50336 VL - 9 KW - natural language processing KW - NLP KW - MedQA KW - generative pre-trained transformer KW - GPT KW - medical education KW - chatbot KW - artificial intelligence KW - AI KW - education technology KW - ChatGPT KW - conversational agent KW - machine learning KW - large language models KW - knowledge assessment UR - https://mededu.jmir.org/2023/1/e50336 UR - http://dx.doi.org/10.2196/50336 UR - http://www.ncbi.nlm.nih.gov/pubmed/37440299 ID - info:doi/10.2196/50336 ER - TY - JOUR AU - Epstein, H. Richard AU - Dexter, Franklin PY - 2023/7/13 TI - Variability in Large Language Models? Responses to Medical Licensing and Certification Examinations. Comment on ?How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment? JO - JMIR Med Educ SP - e48305 VL - 9 KW - natural language processing KW - NLP KW - MedQA KW - generative pre-trained transformer KW - GPT KW - medical education KW - chatbot KW - artificial intelligence KW - AI KW - education technology KW - ChatGPT KW - Google Bard KW - conversational agent KW - machine learning KW - large language models KW - knowledge assessment UR - https://mededu.jmir.org/2023/1/e48305 UR - http://dx.doi.org/10.2196/48305 UR - http://www.ncbi.nlm.nih.gov/pubmed/37440293 ID - info:doi/10.2196/48305 ER - TY - JOUR AU - Babiker, Samar AU - Ogunmwonyi, Innocent AU - Georgi, W. Maria AU - Tan, Lawrence AU - Haque, Sharmi AU - Mullins, William AU - Singh, Prisca AU - Ang, Nadya AU - Fu, Howell AU - Patel, Krunal AU - Khera, Jevan AU - Fricker, Monty AU - Fleming, Simon AU - Giwa-Brown, Lolade AU - A Brennan, Peter AU - Irune, Ekpemi AU - Vig, Stella AU - Nathan, Arjun PY - 2023/6/16 TI - Variation in Experiences and Attainment in Surgery Between Ethnicities of UK Medical Students and Doctors (ATTAIN): Protocol for a Cross-Sectional Study JO - JMIR Res Protoc SP - e40545 VL - 12 KW - diversity in surgery KW - Black and Minority Ethnic KW - BME in surgery KW - differential attainment KW - diversity KW - surgery KW - health care system KW - surgical training KW - disparity KW - ethnic disparity KW - ethnicity KW - medical student KW - doctor KW - training experience KW - surgical placements KW - physician KW - health care provider KW - experience KW - perception KW - cross-sectional KW - doctor in training KW - resident KW - fellow KW - fellowship KW - questionnaire KW - survey KW - Everyday Discrimination Scale KW - Maslach Burnout Inventory KW - Higher Education KW - ethnicities N2 - Background: The unequal distribution of academic and professional outcomes between different minority groups is a pervasive issue in many fields, including surgery. The implications of differential attainment remain significant, not only for the individuals affected but also for the wider health care system. An inclusive health care system is crucial in meeting the needs of an increasingly diverse patient population, thereby leading to better outcomes. One barrier to diversifying the workforce is the differential attainment in educational outcomes between Black and Minority Ethnic (BME) and White medical students and doctors in the United Kingdom.BME trainees are known to have lower performance rates in medical examinations, including undergraduate and postgraduate exams, Annual Review of Competence Progression, as well as training and consultant job applications. Studies have shown that BME candidates have a higher likelihood of failing both parts of the Membership of the Royal Colleges of Surgeons exams and are 10% less likely to be considered suitable for core surgical training.Several contributing factors have been identified; however, there has been limited evidence investigating surgical training experiences and their relationship to differential attainment. To understand the nature of differential attainment in surgery and to develop effective strategies to address it, it is essential to examine the underlying causes and contributing factors.The Variation in Experiences and Attainment in Surgery Between Ethnicities of UK Medical Students and Doctors (ATTAIN) study aims to describe and compare the factors and outcomes of attainment between different ethnicities of doctors and medical students. Objective: The primary aim will be to compare the effect of experiences and perceptions of surgical education of students and doctors of different ethnicities. Methods: This protocol describes a nationwide cross-sectional study of medical students and nonconsultant grade doctors in the United Kingdom. Participants will complete a web-based questionnaire collecting data on experiences and perceptions of surgical placements as well as self-reported academic attainment data. A comprehensive data collection strategy will be used to collect a representative sample of the population. A set of surrogate markers relevant to surgical training will be used to establish a primary outcome to determine variations in attainment. Regression analyses will be used to identify potential causes for the variation in attainment. Results: Data collected between February 2022 and September 2022 yielded 1603 respondents. Data analysis is yet to be competed. The protocol was approved by the University College London Research Ethics Committee on September 16, 2021 (ethics approval reference 19071/004). The findings will be disseminated through peer-reviewed publications and conference presentations. Conclusions: Drawing upon the conclusions of this study, we aim to make recommendations on educational policy reforms. Additionally, the creation of a large, comprehensive data set can be used for further research. International Registered Report Identifier (IRRID): DERR1-10.2196/40545 UR - https://www.researchprotocols.org/2023/1/e40545 UR - http://dx.doi.org/10.2196/40545 UR - http://www.ncbi.nlm.nih.gov/pubmed/37327055 ID - info:doi/10.2196/40545 ER - TY - JOUR AU - Kanzow, Friederike Amelie AU - Schmidt, Dennis AU - Kanzow, Philipp PY - 2023/5/19 TI - Scoring Single-Response Multiple-Choice Items: Scoping Review and Comparison of Different Scoring Methods JO - JMIR Med Educ SP - e44084 VL - 9 KW - alternate-choice KW - best-answer KW - education KW - education system KW - educational assessment KW - educational measurement KW - examination KW - multiple choice KW - results KW - scoring KW - scoring system KW - single choice KW - single response KW - scoping review KW - test KW - testing KW - true/false KW - true-false KW - Type A N2 - Background: Single-choice items (eg, best-answer items, alternate-choice items, single true-false items) are 1 type of multiple-choice items and have been used in examinations for over 100 years. At the end of every examination, the examinees? responses have to be analyzed and scored to derive information about examinees? true knowledge. Objective: The aim of this paper is to compile scoring methods for individual single-choice items described in the literature. Furthermore, the metric expected chance score and the relation between examinees? true knowledge and expected scoring results (averaged percentage score) are analyzed. Besides, implications for potential pass marks to be used in examinations to test examinees for a predefined level of true knowledge are derived. Methods: Scoring methods for individual single-choice items were extracted from various databases (ERIC, PsycInfo, Embase via Ovid, MEDLINE via PubMed) in September 2020. Eligible sources reported on scoring methods for individual single-choice items in written examinations including but not limited to medical education. Separately for items with n=2 answer options (eg, alternate-choice items, single true-false items) and best-answer items with n=5 answer options (eg, Type A items) and for each identified scoring method, the metric expected chance score and the expected scoring results as a function of examinees? true knowledge using fictitious examinations with 100 single-choice items were calculated. Results: A total of 21 different scoring methods were identified from the 258 included sources, with varying consideration of correctly marked, omitted, and incorrectly marked items. Resulting credit varied between ?3 and +1 credit points per item. For items with n=2 answer options, expected chance scores from random guessing ranged between ?1 and +0.75 credit points. For items with n=5 answer options, expected chance scores ranged between ?2.2 and +0.84 credit points. All scoring methods showed a linear relation between examinees? true knowledge and the expected scoring results. Depending on the scoring method used, examination results differed considerably: Expected scoring results from examinees with 50% true knowledge ranged between 0.0% (95% CI 0% to 0%) and 87.5% (95% CI 81.0% to 94.0%) for items with n=2 and between ?60.0% (95% CI ?60% to ?60%) and 92.0% (95% CI 86.7% to 97.3%) for items with n=5. Conclusions: In examinations with single-choice items, the scoring result is not always equivalent to examinees? true knowledge. When interpreting examination scores and setting pass marks, the number of answer options per item must usually be taken into account in addition to the scoring method used. UR - https://mededu.jmir.org/2023/1/e44084 UR - http://dx.doi.org/10.2196/44084 UR - http://www.ncbi.nlm.nih.gov/pubmed/37001510 ID - info:doi/10.2196/44084 ER - TY - JOUR AU - Kanzow, Philipp AU - Schmidt, Dennis AU - Herrmann, Manfred AU - Wassmann, Torsten AU - Wiegand, Annette AU - Raupach, Tobias PY - 2023/3/27 TI - Use of Multiple-Select Multiple-Choice Items in a Dental Undergraduate Curriculum: Retrospective Study Involving the Application of Different Scoring Methods JO - JMIR Med Educ SP - e43792 VL - 9 KW - dental education KW - education system KW - educational assessment KW - educational measurement KW - examination KW - k of n KW - Kprim KW - K? KW - MTF KW - Multiple-True-False KW - Pick-N KW - scoring KW - scoring system KW - Type X KW - undergraduate KW - undergraduate curriculum KW - undergraduate education N2 - Background: Scoring and awarding credit are more complex for multiple-select items than for single-choice items. Forty-one different scoring methods were retrospectively applied to 2 multiple-select multiple-choice item types (Pick-N and Multiple-True-False [MTF]) from existing examination data. Objective: This study aimed to calculate and compare the mean scores for both item types by applying different scoring methods, and to investigate the effect of item quality on mean raw scores and the likelihood of resulting scores at or above the pass level (?0.6). Methods: Items and responses from examinees (ie, marking events) were retrieved from previous examinations. Different scoring methods were retrospectively applied to the existing examination data to calculate corresponding examination scores. In addition, item quality was assessed using a validated checklist. Statistical analysis was performed using the Kruskal-Wallis test, Wilcoxon rank-sum test, and multiple logistic regression analysis (P<.05). Results: We analyzed 1931 marking events of 48 Pick-N items and 828 marking events of 18 MTF items. For both item types, scoring results widely differed between scoring methods (minimum: 0.02, maximum: 0.98; P<.001). Both the use of an inappropriate item type (34 items) and the presence of cues (30 items) impacted the scoring results. Inappropriately used Pick-N items resulted in lower mean raw scores (0.88 vs 0.93; P<.001), while inappropriately used MTF items resulted in higher mean raw scores (0.88 vs 0.85; P=.001). Mean raw scores were higher for MTF items with cues than for those without cues (0.91 vs 0.8; P<.001), while mean raw scores for Pick-N items with and without cues did not differ (0.89 vs 0.90; P=.09). Item quality also impacted the likelihood of resulting scores at or above the pass level (odds ratio ?6.977). Conclusions: Educators should pay attention when using multiple-select multiple-choice items and select the most appropriate item type. Different item types, different scoring methods, and presence of cues are likely to impact examinees? scores and overall examination results. UR - https://mededu.jmir.org/2023/1/e43792 UR - http://dx.doi.org/10.2196/43792 UR - http://www.ncbi.nlm.nih.gov/pubmed/36841970 ID - info:doi/10.2196/43792 ER - TY - JOUR AU - Gilson, Aidan AU - Safranek, W. Conrad AU - Huang, Thomas AU - Socrates, Vimig AU - Chi, Ling AU - Taylor, Andrew Richard AU - Chartash, David PY - 2023/2/8 TI - How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment JO - JMIR Med Educ SP - e45312 VL - 9 KW - natural language processing KW - NLP KW - MedQA KW - generative pre-trained transformer KW - GPT KW - medical education KW - chatbot KW - artificial intelligence KW - education technology KW - ChatGPT KW - conversational agent KW - machine learning KW - USMLE N2 - Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT?s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT?s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT?s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT?s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. UR - https://mededu.jmir.org/2023/1/e45312 UR - http://dx.doi.org/10.2196/45312 UR - http://www.ncbi.nlm.nih.gov/pubmed/36753318 ID - info:doi/10.2196/45312 ER - TY - JOUR AU - McLeod, Graeme AU - McKendrick, Mel AU - Tafili, Tedis AU - Obregon, Mateo AU - Neary, Ruth AU - Mustafa, Ayman AU - Raju, Pavan AU - Kean, Donna AU - McKendrick, Gary AU - McKendrick, Tuesday PY - 2022/8/11 TI - Patterns of Skills Acquisition in Anesthesiologists During Simulated Interscalene Block Training on a Soft Embalmed Thiel Cadaver: Cohort Study JO - JMIR Med Educ SP - e32840 VL - 8 IS - 3 KW - regional anesthesia KW - ultrasonography KW - simulation KW - learning curves KW - eye tracking N2 - Background: The demand for regional anesthesia for major surgery has increased considerably, but only a small number of anesthesiologists can provide such care. Simulations may improve clinical performance. However, opportunities to rehearse procedures are limited, and the clinical educational outcomes prescribed by the Royal College of Anesthesiologists training curriculum 2021 are difficult to attain. Educational paradigms, such as mastery learning and dedicated practice, are increasingly being used to teach technical skills to enhance skills acquisition. Moreover, high-fidelity, resilient cadaver simulators are now available: the soft embalmed Thiel cadaver shows physical characteristics and functional alignment similar to those of patients. Tissue elasticity allows tissues to expand and relax, fluid to drain away, and hundreds of repeated injections to be tolerated without causing damage. Learning curves and their intra- and interindividual dynamics have not hitherto been measured on the Thiel cadaver simulator using the mastery learning and dedicated practice educational paradigm coupled with validated, quantitative metrics, such as checklists, eye tracking metrics, and self-rating scores. Objective: Our primary objective was to measure the learning slopes of the scanning and needling phases of an interscalene block conducted repeatedly on a soft embalmed Thiel cadaver over a 3-hour period of training. Methods: A total of 30 anesthesiologists, with a wide range of experience, conducted up to 60 ultrasound-guided interscalene blocks over 3 hours on the left side of 2 soft embalmed Thiel cadavers. The duration of the scanning and needling phases was defined as the time taken to perform all the steps correctly. The primary outcome was the best-fit linear slope of the log-log transformed time to complete each phase. Our secondary objectives were to measure preprocedural psychometrics, describe deviations from the learning slope, correlate scanning and needling phase data, characterize skills according to clinical grade, measure learning curves using objective eye gaze tracking and subjective self-rating measures, and use cluster analysis to categorize performance irrespective of grade. Results: The median (IQR; range) log-log learning slopes were ?0.47 (?0.62 to ?0.32; ?0.96 to 0.30) and ?0.23 (?0.34 to ?0.19; ?0.71 to 0.27) during the scanning and needling phases, respectively. Locally Weighted Scatterplot Smoother curves showed wide variability in within-participant performance. The learning slopes of the scanning and needling phases correlated: ?=0.55 (0.23-0.76), P<.001, and ?=?0.72 (?0.46 to ?0.87), P<.001, respectively. Eye gaze fixation count and glance count during the scanning and needling phases best reflected block duration. Using clustering techniques, fixation count and glance were used to identify 4 distinct patterns of learning behavior. Conclusions: We quantified learning slopes by log-log transformation of the time taken to complete the scanning and needling phases of interscalene blocks and identified intraindividual and interindividual patterns of variability. UR - https://mededu.jmir.org/2022/3/e32840 UR - http://dx.doi.org/10.2196/32840 UR - http://www.ncbi.nlm.nih.gov/pubmed/35543314 ID - info:doi/10.2196/32840 ER - TY - JOUR AU - Landis-Lewis, Zach AU - Flynn, Allen AU - Janda, Allison AU - Shah, Nirav PY - 2022/5/10 TI - A Scalable Service to Improve Health Care Quality Through Precision Audit and Feedback: Proposal for a Randomized Controlled Trial JO - JMIR Res Protoc SP - e34990 VL - 11 IS - 5 KW - learning health system KW - audit and feedback KW - anesthesiology KW - knowledge-based system KW - human-centered design N2 - Background: Health care delivery organizations lack evidence-based strategies for using quality measurement data to improve performance. Audit and feedback (A&F), the delivery of clinical performance summaries to providers, demonstrates the potential for large effects on clinical practice but is currently implemented as a blunt one size fits most intervention. Each provider in a care setting typically receives a performance summary of identical metrics in a common format despite the growing recognition that precisionizing interventions hold significant promise in improving their impact. A precision approach to A&F prioritizes the display of information in a single metric that, for each recipient, carries the highest value for performance improvement, such as when the metric?s level drops below a peer benchmark or minimum standard for the first time, thereby revealing an actionable performance gap. Furthermore, precision A&F uses an optimal message format (including framing and visual displays) based on what is known about the recipient and the intended gist meaning being communicated to improve message interpretation while reducing the cognitive processing burden. Well-established psychological principles, frameworks, and theories form a feedback intervention knowledge base to achieve precision A&F. From an informatics perspective, precision A&F requires a knowledge-based system that enables mass customization by representing knowledge configurable at the group and individual levels. Objective: This study aims to implement and evaluate a demonstration system for precision A&F in anesthesia care and to assess the effect of precision feedback emails on care quality and outcomes in a national quality improvement consortium. Methods: We propose to achieve our aims by conducting 3 studies: a requirements analysis and preferences elicitation study using human-centered design and conjoint analysis methods, a software service development and implementation study, and a cluster randomized controlled trial of a precision A&F service with a concurrent process evaluation. This study will be conducted with the Multicenter Perioperative Outcomes Group, a national anesthesia quality improvement consortium with >60 member hospitals in >20 US states. This study will extend the Multicenter Perioperative Outcomes Group quality improvement infrastructure by using existing data and performance measurement processes. Results: The proposal was funded in September 2021 with a 4-year timeline. Data collection for Aim 1 began in March 2022. We plan for a 24-month trial timeline, with the intervention period of the trial beginning in March 2024. Conclusions: The proposed aims will collectively demonstrate a precision feedback service developed using an open-source technical infrastructure for computable knowledge management. By implementing and evaluating a demonstration system for precision feedback, we create the potential to observe the conditions under which feedback interventions are effective. International Registered Report Identifier (IRRID): PRR1-10.2196/34990 UR - https://www.researchprotocols.org/2022/5/e34990 UR - http://dx.doi.org/10.2196/34990 UR - http://www.ncbi.nlm.nih.gov/pubmed/35536637 ID - info:doi/10.2196/34990 ER - TY - JOUR AU - Tamblyn, Robert AU - Brieva, Jorge AU - Cain, Madeleine AU - Martinez, Eduardo F. PY - 2022/3/7 TI - The Effects of Introducing a Mobile App?Based Procedural Logbook on Trainee Compliance to a Central Venous Catheter Insertion Accreditation Program: Before-and-After Study JO - JMIR Hum Factors SP - e35199 VL - 9 IS - 1 KW - logbook KW - education KW - training KW - central venous catheter KW - CVC KW - intensive care KW - smartphone KW - mobile phone KW - mobile apps KW - mHealth KW - mobile health KW - accreditation program KW - digital health KW - digital record N2 - Background: To reduce complications associated with central venous catheter (CVC) insertions, local accreditation programs using a supervised procedural logbook are essential. To increase compliance with such a logbook, a mobile app could provide the ideal platform for training doctors in an adult intensive care unit (ICU). Objective: The aim of this paper was to compare trainee compliance with the completion of a logbook as part of a CVC insertion accreditation program, before and after the introduction of an app-based logbook. Methods: This is a retrospective observational study of logbook data, before and after the introduction of a purpose-built, app-based, electronic logbook to complement an existing paper-based logbook. Carried out over a 2-year period in the adult ICU of the John Hunter Hospital, Newcastle, NSW, Australia, the participants were ICU trainee medical officers completing a CVC insertion accreditation program. The primary outcome was the proportion of all CVC insertions documented in the patients? electronic medical records appearing as logbook entries. To assess logbook entry quality, we measured and compared the proportion of logbook entries that were approved by a supervisor and contained a supervisor?s signature for the before and after periods. We also analyzed trainee participation before and after the intervention by comparing the total number of active logbook users, and the proportion of first-time users who logged 3 or more CVC insertions. Results: Of the 2987 CVC insertions documented in the electronic medical records between April 7, 2019, and April 6, 2021, 2161 (72%) were included and separated into cohorts before and after the app?s introduction. Following the introduction of the app-based logbook, the percentage of CVC insertions appearing as logbook entries increased from 3.6% (38/1059) to 20.5% (226/1102; P<.001). There was no difference in the proportion of supervisor-approved entries containing a supervisor?s signature before and after the introduction of the app, with 76.3% (29/38) and 83.2% (188/226), respectively (P=.31). After the introduction of the app, there was an increase in the percentage of active logbook users from 15.3% (13/85) to 62.8% (54/86; P<.001). Adherence to one?s logbook was similar in both groups with 60% (6/10) of first-time users in the before group and 79.5% (31/39) in the after group going on to log at least 3 or more CVCs during their time working in ICU. Conclusions: The addition of an electronic app-based logbook to a preexisting paper-based logbook was associated with a higher rate of logbook compliance in trainee doctors undertaking an accreditation program for CVC insertion in an adult ICU. There was a large increase in logbook use observed without a reduction in the quality of logbook entries. The overall trainee participation also improved with an observed increase in active logbook users and no reduction in the average number of entries per user following the introduction of the app. Further studies on app-based logbooks for ICU procedural accreditation programs are warranted. UR - https://humanfactors.jmir.org/2022/1/e35199 UR - http://dx.doi.org/10.2196/35199 UR - http://www.ncbi.nlm.nih.gov/pubmed/35051900 ID - info:doi/10.2196/35199 ER - TY - JOUR AU - Schoenmakers, Birgitte AU - Wens, Johan PY - 2021/8/16 TI - Efficiency, Usability, and Outcomes of Proctored Next-Level Exams for Proficiency Testing in Primary Care Education: Observational Study JO - JMIR Form Res SP - e23834 VL - 5 IS - 8 KW - primary care KW - education KW - graduate KW - medical education KW - testing KW - assessment KW - app KW - COVID-19 KW - efficiency KW - accuracy N2 - Background: The COVID-19 pandemic has affected education and assessment programs and has resulted in complex planning. Therefore, we organized the proficiency test for admission to the Family Medicine program as a proctored exam. To prevent fraud, we developed a web-based supervisor app for tracking and tracing candidates? behaviors. Objective: We aimed to assess the efficiency and usability of the proctored exam procedure and to analyze the procedure?s impact on exam scores. Methods: The application operated on the following three levels to register events: the recording of actions, analyses of behavior, and live supervision. Each suspicious event was given a score. To assess efficiency, we logged the technical issues and the interventions. To test usability, we counted the number of suspicious students and behaviors. To analyze the impact that the supervisor app had on students? exam outcomes, we compared the scores of the proctored group and those of the on-campus group. Candidates were free to register for off-campus participation or on-campus participation. Results: Of the 593 candidates who subscribed to the exam, 472 (79.6%) used the supervisor app and 121 (20.4%) were on campus. The test results of both groups were comparable. We registered 15 technical issues that occurred off campus. Further, 2 candidates experienced a negative impact on their exams due to technical issues. The application detected 22 candidates with a suspicion rating of >1. Suspicion ratings mainly increased due to background noise. All events occurred without fraudulent intent. Conclusions: This pilot observational study demonstrated that a supervisor app that records and registers behavior was able to detect suspicious events without having an impact on exams. Background noise was the most critical event. There was no fraud detected. A supervisor app that registers and records behavior to prevent fraud during exams was efficient and did not affect exam outcomes. In future research, a controlled study design should be used to compare the cost-benefit balance between the complex interventions of the supervisor app and candidates? awareness of being monitored via a safe browser plug-in for exams. UR - https://formative.jmir.org/2021/8/e23834 UR - http://dx.doi.org/10.2196/23834 UR - http://www.ncbi.nlm.nih.gov/pubmed/34398786 ID - info:doi/10.2196/23834 ER - TY - JOUR AU - Godoy, Barros Ivan Rodrigues AU - Neto, Pecci Luís AU - Skaf, Abdalla AU - Leão-Filho, Muniz Hilton AU - Freddi, Lourenço Tomás De Andrade AU - Jasinowodolinski, Dany AU - Yamada, Fukunishi André PY - 2021/5/20 TI - Audiovisual Content for a Radiology Fellowship Selection Process During the COVID-19 Pandemic: Pilot Web-Based Questionnaire Study JO - JMIR Med Educ SP - e28733 VL - 7 IS - 2 KW - audiovisual reports KW - COVID-19 KW - fellowship KW - radiology KW - smartphones KW - video recording KW - web technology N2 - Background: Traditional radiology fellowships are usually 1- or 2-year clinical training programs in a specific area after completion of a 4-year residency program. Objective: This study aimed to investigate the experience of fellowship applicants in answering radiology questions in an audiovisual format using their own smartphones after answering radiology questions in a traditional printed text format as part of the application process during the COVID-19 pandemic. We hypothesized that fellowship applicants would find that recorded audiovisual radiology content adds value to the conventional selection process, may increase engagement by using their own smartphone device, and facilitate the understanding of imaging findings of radiology-based questions, while maintaining social distancing. Methods: One senior staff radiologist of each subspecialty prepared 4 audiovisual radiology questions for each subspecialty. We conducted a survey using web-based questionnaires for 123 fellowship applications for musculoskeletal (n=39), internal medicine (n=61), and neuroradiology (n=23) programs to evaluate the experience of using audiovisual radiology content as a substitute for the conventional text evaluation. Results: Most of the applicants (n=122, 99%) answered positively (with responses of ?agree? or ?strongly agree?) that images in digital forms are of superior quality to those printed on paper. In total, 101 (82%) applicants agreed with the statement that the presentation of cases in audiovisual format facilitates the understanding of the findings. Furthermore, 81 (65%) candidates agreed or strongly agreed that answering digital forms is more practical than conventional paper forms. Conclusions: The use of audiovisual content as part of the selection process for radiology fellowships is a new approach to evaluate the potential to enhance the applicant?s experience during this process. This technology also allows for the evaluation of candidates without the need for in-person interaction. Further studies could streamline these methods to minimize work redundancy with traditional text assessments or even evaluate the acceptance of using only audiovisual content on smartphones. UR - https://mededu.jmir.org/2021/2/e28733 UR - http://dx.doi.org/10.2196/28733 UR - http://www.ncbi.nlm.nih.gov/pubmed/33956639 ID - info:doi/10.2196/28733 ER - TY - JOUR AU - Fatima, Rawish AU - Assaly, R. Ahmad AU - Aziz, Muhammad AU - Moussa, Mohamad AU - Assaly, Ragheb PY - 2021/4/30 TI - The United States Medical Licensing Exam Step 2 Clinical Skills Examination: Potential Alternatives During and After the COVID-19 Pandemic JO - JMIR Med Educ SP - e25903 VL - 7 IS - 2 KW - USMLE KW - United States Medical Licensing Examination KW - The National Resident Matching Program KW - NRMP KW - Step 2 Clinical Skills KW - Step 2 CS KW - medical school KW - medical education KW - test KW - medical student KW - United States KW - online learning KW - exam KW - alternative KW - model KW - COVID-19 UR - https://mededu.jmir.org/2021/2/e25903 UR - http://dx.doi.org/10.2196/25903 UR - http://www.ncbi.nlm.nih.gov/pubmed/33878014 ID - info:doi/10.2196/25903 ER - TY - JOUR AU - Fink, C. Maximilian AU - Reitmeier, Victoria AU - Stadler, Matthias AU - Siebeck, Matthias AU - Fischer, Frank AU - Fischer, R. Martin PY - 2021/3/4 TI - Assessment of Diagnostic Competences With Standardized Patients Versus Virtual Patients: Experimental Study in the Context of History Taking JO - J Med Internet Res SP - e21196 VL - 23 IS - 3 KW - clinical reasoning KW - medical education KW - performance-based assessment KW - simulation KW - standardized patient KW - virtual patient N2 - Background: Standardized patients (SPs) have been one of the popular assessment methods in clinical teaching for decades, although they are resource intensive. Nowadays, simulated virtual patients (VPs) are increasingly used because they are permanently available and fully scalable to a large audience. However, empirical studies comparing the differential effects of these assessment methods are lacking. Similarly, the relationships between key variables associated with diagnostic competences (ie, diagnostic accuracy and evidence generation) in these assessment methods still require further research. Objective: The aim of this study is to compare perceived authenticity, cognitive load, and diagnostic competences in performance-based assessment using SPs and VPs. This study also aims to examine the relationships of perceived authenticity, cognitive load, and quality of evidence generation with diagnostic accuracy. Methods: We conducted an experimental study with 86 medical students (mean 26.03 years, SD 4.71) focusing on history taking in dyspnea cases. Participants solved three cases with SPs and three cases with VPs in this repeated measures study. After each case, students provided a diagnosis and rated perceived authenticity and cognitive load. The provided diagnosis was scored in terms of diagnostic accuracy; the questions asked by the medical students were rated with respect to their quality of evidence generation. In addition to regular null hypothesis testing, this study used equivalence testing to investigate the absence of meaningful effects. Results: Perceived authenticity (1-tailed t81=11.12; P<.001) was higher for SPs than for VPs. The correlation between diagnostic accuracy and perceived authenticity was very small (r=0.05) and neither equivalent (P=.09) nor statistically significant (P=.32). Cognitive load was equivalent in both assessment methods (t82=2.81; P=.003). Intrinsic cognitive load (1-tailed r=?0.30; P=.003) and extraneous load (1-tailed r=?0.29; P=.003) correlated negatively with the combined score for diagnostic accuracy. The quality of evidence generation was positively related to diagnostic accuracy for VPs (1-tailed r=0.38; P<.001); this finding did not hold for SPs (1-tailed r=0.05; P=.32). Comparing both assessment methods with each other, diagnostic accuracy was higher for SPs than for VPs (2-tailed t85=2.49; P=.01). Conclusions: The results on perceived authenticity demonstrate that learners experience SPs as more authentic than VPs. As higher amounts of intrinsic and extraneous cognitive loads are detrimental to performance, both types of cognitive load must be monitored and manipulated systematically in the assessment. Diagnostic accuracy was higher for SPs than for VPs, which could potentially negatively affect students? grades with VPs. We identify and discuss possible reasons for this performance difference between both assessment methods. UR - https://www.jmir.org/2021/3/e21196 UR - http://dx.doi.org/10.2196/21196 UR - http://www.ncbi.nlm.nih.gov/pubmed/33661122 ID - info:doi/10.2196/21196 ER - TY - JOUR AU - Fonteneau, Tristan AU - Billion, Elodie AU - Abdoul, Cindy AU - Le, Sebastien AU - Hadchouel, Alice AU - Drummond, David PY - 2020/12/16 TI - Simulation Game Versus Multiple Choice Questionnaire to Assess the Clinical Competence of Medical Students: Prospective Sequential Trial JO - J Med Internet Res SP - e23254 VL - 22 IS - 12 KW - serious game KW - simulation game KW - assessment KW - professional competence KW - asthma KW - pediatrics N2 - Background: The use of simulation games (SG) to assess the clinical competence of medical students has been poorly studied. Objective: The objective of this study was to assess whether an SG better reflects the clinical competence of medical students than a multiple choice questionnaire (MCQ). Methods: Fifth-year medical students in Paris (France) were included and individually evaluated on a case of pediatric asthma exacerbation using three successive modalities: high-fidelity simulation (HFS), considered the gold standard for the evaluation of clinical competence, the SG Effic?Asthme, and an MCQ designed for the study. The primary endpoint was the median kappa coefficient evaluating the correlation of the actions performed by the students between the SG and HFS modalities and the MCQ and HFS modalities. Student satisfaction was also evaluated. Results: Forty-two students were included. The actions performed by the students were more reproducible between the SG and HFS modalities than between the MCQ and HFS modalities (P=.04). Students reported significantly higher satisfaction with the SG (P<.01) than with the MCQ modality. Conclusions: The SG Effic?Asthme better reflected the actions performed by medical students during an HFS session than an MCQ on the same asthma exacerbation case. Because SGs allow the assessment of more dimensions of clinical competence than MCQs, they are particularly appropriate for the assessment of medical students on situations involving symptom recognition, prioritization of decisions, and technical skills. Trial Registration: ClinicalTrials.gov NCT03884114; https://clinicaltrials.gov/ct2/show/NCT03884114 UR - http://www.jmir.org/2020/12/e23254/ UR - http://dx.doi.org/10.2196/23254 UR - http://www.ncbi.nlm.nih.gov/pubmed/33325833 ID - info:doi/10.2196/23254 ER - TY - JOUR AU - Yin, Lukas Andrew AU - Gheissari, Pargol AU - Lin, Wanyin Inna AU - Sobolev, Michael AU - Pollak, P. John AU - Cole, Curtis AU - Estrin, Deborah PY - 2020/11/3 TI - Role of Technology in Self-Assessment and Feedback Among Hospitalist Physicians: Semistructured Interviews and Thematic Analysis JO - J Med Internet Res SP - e23299 VL - 22 IS - 11 KW - feedback KW - self-assessment KW - self-learning KW - hospitalist KW - electronic medical record KW - digital health KW - assessment KW - learning N2 - Background: Lifelong learning is embedded in the culture of medicine, but there are limited tools currently available for many clinicians, including hospitalists, to help improve their own practice. Although there are requirements for continuing medical education, resources for learning new clinical guidelines, and developing fields aimed at facilitating peer-to-peer feedback, there is a gap in the availability of tools that enable clinicians to learn based on their own patients and clinical decisions. Objective: The aim of this study was to explore the technologies or modifications to existing systems that could be used to benefit hospitalist physicians in pursuing self-assessment and improvement by understanding physicians? current practices and their reactions to proposed possibilities. Methods: Semistructured interviews were conducted in two separate stages with analysis performed after each stage. In the first stage, interviews (N=12) were conducted to understand the ways in which hospitalist physicians are currently gathering feedback and assessing their practice. A thematic analysis of these interviews informed the prototype used to elicit responses in the second stage. Results: Clinicians actively look for feedback that they can apply to their practice, with the majority of the feedback obtained through self-assessment. The following three themes surrounding this aspect were identified in the first round of semistructured interviews: collaboration, self-reliance, and uncertainty, each with three related subthemes. Using a wireframe, the second round of interviews led to identifying the features that are currently challenging to use or could be made available with technology. Conclusions: Based on each theme and subtheme, we provide targeted recommendations for use by relevant stakeholders such as institutions, clinicians, and technologists. Most hospitalist self-assessments occur on a rolling basis, specifically using data in electronic medical records as their primary source. Specific objective data points or subjective patient relationships lead clinicians to review their patient cases and to assess their own performance. However, current systems are not built for these analyses or for clinicians to perform self-assessment, making this a burdensome and incomplete process. Building a platform that focuses on providing and curating the information used for self-assessment could help physicians make more accurately informed changes to their own clinical practice and decision-making. UR - http://www.jmir.org/2020/11/e23299/ UR - http://dx.doi.org/10.2196/23299 UR - http://www.ncbi.nlm.nih.gov/pubmed/33141098 ID - info:doi/10.2196/23299 ER - TY - JOUR AU - Grima-Murcia, D. M. AU - Sanchez-Ferrer, Francisco AU - Ramos-Rincón, Manuel Jose AU - Fernández, Eduardo PY - 2020/8/21 TI - Use of Eye-Tracking Technology by Medical Students Taking the Objective Structured Clinical Examination: Descriptive Study JO - J Med Internet Res SP - e17719 VL - 22 IS - 8 KW - visual perception KW - medical education KW - eye tracking KW - objective structured clinical examination KW - medical evaluation N2 - Background: The objective structured clinical examination (OSCE) is a test used throughout Spain to evaluate the clinical competencies, decision making, problem solving, and other skills of sixth-year medical students. Objective: The main goal of this study is to explore the possible applications and utility of portable eye-tracking systems in the setting of the OSCE, particularly questions associated with attention and engagement. Methods: We used a portable Tobii Glasses 2 eye tracker, which allows real-time monitoring of where the students were looking and records the voice and ambient sounds. We then performed a qualitative and a quantitative analysis of the fields of vision and gaze points attracting attention as well as the visual itinerary. Results: Eye-tracking technology was used in the OSCE with no major issues. This portable system was of the greatest value in the patient simulators and mannequin stations, where interaction with the simulated patient or areas of interest in the mannequin can be quantified. This technology proved useful to better identify the areas of interest in the medical images provided. Conclusions: Portable eye trackers offer the opportunity to improve the objective evaluation of candidates and the self-evaluation of the stations used as well as medical simulations by examiners. We suggest that this technology has enough resolution to identify where a student is looking at and could be useful for developing new approaches for evaluating specific aspects of clinical competencies. UR - http://www.jmir.org/2020/8/e17719/ UR - http://dx.doi.org/10.2196/17719 UR - http://www.ncbi.nlm.nih.gov/pubmed/32821060 ID - info:doi/10.2196/17719 ER - TY - JOUR AU - Liu, Benjamin PY - 2020/7/30 TI - The United States Medical Licensing Examination Step 1 Is Changing?US Medical Curricula Should Too JO - JMIR Med Educ SP - e20182 VL - 6 IS - 2 KW - USMLE KW - US medical students KW - USMLE pass/fail KW - new curricula KW - medical education KW - medical learning KW - medical school UR - http://mededu.jmir.org/2020/2/e20182/ UR - http://dx.doi.org/10.2196/20182 UR - http://www.ncbi.nlm.nih.gov/pubmed/32667900 ID - info:doi/10.2196/20182 ER - TY - JOUR AU - Staziaki, Vinícius Pedro AU - Sarangi, Rutuparna AU - Parikh, N. Ujas AU - Brooks, G. Jeffrey AU - LeBedis, Alexandra Christina AU - Shaffer, Kitt PY - 2020/5/6 TI - An Objective Structured Clinical Examination for Medical Student Radiology Clerkships: Reproducibility Study JO - JMIR Med Educ SP - e15444 VL - 6 IS - 1 KW - radiology KW - education KW - education methods KW - medical education KW - undergraduate N2 - Background: Objective structured clinical examinations (OSCEs) are a useful method to evaluate medical students? performance in the clerkship years. OSCEs are designed to assess skills and knowledge in a standardized clinical setting and through use of a preset standard grading sheet, so that clinical knowledge can be evaluated at a high level and in a reproducible way. Objective: This study aimed to present our OSCE assessment tool designed specifically for radiology clerkship medical students, which we called the objective structured radiology examination (OSRE), with the intent to advance the assessment of clerkship medical students by providing an objective, structured, reproducible, and low-cost method to evaluate medical students? radiology knowledge and the reproducibility of this assessment tool. Methods: We designed 9 different OSRE cases for radiology clerkship classes with participating third- and fourth-year medical students. Each examination comprises 1 to 3 images, a clinical scenario, and structured questions, along with a standardized scoring sheet that allows for an objective and low-cost assessment. Each medical student completed 3 of 9 random examination cases during their rotation. To evaluate for reproducibility of our scoring sheet assessment tool, we used 5 examiners to grade the same students. Reproducibility for each case and consistency for each grader were assessed with a two-way mixed effects intraclass correlation coefficient (ICC). An ICC below 0.4 was deemed poor to fair, an ICC of 0.41 to 0.60 was moderate, an ICC of 0.6 to 0.8 was substantial, and an ICC greater than 0.8 was almost perfect. We also assessed the correlation of scores and the students? clinical experience with a linear regression model and compared mean grades between third- and fourth-year students. Results: A total of 181 students (156 third- and 25 fourth-year students) were included in the study for a full academic year. Moreover, 6 of 9 cases demonstrated average ICCs more than 0.6 (substantial correlation), and the average ICCs ranged from 0.36 to 0.80 (P<.001 for all the cases). The average ICC for each grader was more than 0.60 (substantial correlation). The average grade among the third-year students was 11.9 (SD 4.9), compared with 12.8 (SD 5) among the fourth-year students (P=.005). There was no correlation between clinical experience and OSRE grade (?0.02; P=.48), adjusting for the medical school year. Conclusions: Our OSRE is a reproducible assessment tool with most of our OSRE cases showing substantial correlation, except for 3 cases. No expertise in radiology is needed to grade these examinations using our scoring sheet. There was no correlation between scores and the clinical experience of the medical students tested. UR - http://mededu.jmir.org/2020/1/e15444/ UR - http://dx.doi.org/10.2196/15444 UR - http://www.ncbi.nlm.nih.gov/pubmed/32374267 ID - info:doi/10.2196/15444 ER - TY - JOUR AU - Mazor, M. Kathleen AU - King, M. Ann AU - Hoppe, B. Ruth AU - Kochersberger, O. Annie AU - Yan, Jie AU - Reim, D. Jesse PY - 2019/02/14 TI - Video-Based Communication Assessment: Development of an Innovative System for Assessing Clinician-Patient Communication JO - JMIR Med Educ SP - e10400 VL - 5 IS - 1 KW - communication KW - crowdsourcing KW - health care KW - mobile phone KW - patient-centered care KW - video-based communication assessment UR - http://mededu.jmir.org/2019/1/e10400/ UR - http://dx.doi.org/10.2196/10400 UR - http://www.ncbi.nlm.nih.gov/pubmed/30710460 ID - info:doi/10.2196/10400 ER - TY - JOUR AU - Rat, Anne-Christine AU - Ricci, Laetitia AU - Guillemin, Francis AU - Ricatte, Camille AU - Pongy, Manon AU - Vieux, Rachel AU - Spitz, Elisabeth AU - Muller, Laurent PY - 2018/07/19 TI - Development of a Web-Based Formative Self-Assessment Tool for Physicians to Practice Breaking Bad News (BRADNET) JO - JMIR Med Educ SP - e17 VL - 4 IS - 2 KW - bad news disclosure KW - health communication KW - physician-patient relationship KW - distance e-learning N2 - Background: Although most physicians in medical settings have to deliver bad news, the skills of delivering bad news to patients have been given insufficient attention. Delivering bad news is a complex communication task that includes verbal and nonverbal skills, the ability to recognize and respond to patients? emotions and the importance of considering the patient?s environment such as culture and social status. How bad news is delivered can have consequences that may affect patients, sometimes over the long term. Objective: This project aimed to develop a Web-based formative self-assessment tool for physicians to practice delivering bad news to minimize the deleterious effects of poor way of breaking bad news about a disease, whatever the disease. Methods: BReaking bAD NEws Tool (BRADNET) items were developed by reviewing existing protocols and recommendations for delivering bad news. We also examined instruments for assessing patient-physician communications and conducted semistructured interviews with patients and physicians. From this step, we selected specific themes and then pooled these themes before consensus was achieved on a good practices communication framework list. Items were then created from this list. To ensure that physicians found BRADNET acceptable, understandable, and relevant to their patients? condition, the tool was refined by a working group of clinicians familiar with delivering bad news. The think-aloud approach was used to explore the impact of the items and messages and why and how these messages could change physicians? relations with patients or how to deliver bad news. Finally, formative self-assessment sessions were constructed according to a double perspective of progression: a chronological progression of the disclosure of the bad news and the growing difficulty of items (difficulty concerning the expected level of self-reflection). Results: The good practices communication framework list comprised 70 specific issues related to breaking bad news pooled into 8 main domains: opening, preparing for the delivery of bad news, communication techniques, consultation content, attention, physician emotional management, shared decision making, and the relationship between the physician and the medical team. After constructing the items from this list, the items were extensively refined to make them more useful to the target audience, and one item was added. BRADNET contains 71 items, each including a question, response options, and a corresponding message, which were divided into 8 domains and assessed with 12 self-assessment sessions. The BRADNET Web-based platform was developed according to the cognitive load theory and the cognitive theory of multimedia learning. Conclusions: The objective of this Web-based assessment tool was to create a ?space? for reflection. It contained items leading to self-reflection and messages that introduced recommended communication behaviors. Our approach was innovative as it provided an inexpensive distance-learning self-assessment tool that was manageable and less time-consuming for physicians with often overwhelming schedules. UR - http://mededu.jmir.org/2018/2/e17/ UR - http://dx.doi.org/10.2196/mededu.9551 UR - http://www.ncbi.nlm.nih.gov/pubmed/30026180 ID - info:doi/10.2196/mededu.9551 ER - TY - JOUR AU - Adjedj, Julien AU - Ducrocq, Gregory AU - Bouleti, Claire AU - Reinhart, Louise AU - Fabbro, Eleonora AU - Elbez, Yedid AU - Fischer, Quentin AU - Tesniere, Antoine AU - Feldman, Laurent AU - Varenne, Olivier PY - 2017/05/16 TI - Medical Student Evaluation With a Serious Game Compared to Multiple Choice Questions Assessment JO - JMIR Serious Games SP - e11 VL - 5 IS - 2 KW - serious game KW - multiple choice questions KW - medical student KW - student evaluation N2 - Background: The gold standard for evaluating medical students? knowledge is by multiple choice question (MCQs) tests: an objective and effective means of restituting book-based knowledge. However, concerns have been raised regarding their effectiveness to evaluate global medical skills. Furthermore, MCQs of unequal difficulty can generate frustration and may also lead to a sizable proportion of close results with low score variability. Serious games (SG) have recently been introduced to better evaluate students? medical skills. Objectives: The study aimed to compare MCQs with SG for medical student evaluation. Methods: We designed a cross-over randomized study including volunteer medical students from two medical schools in Paris (France) from January to September 2016. The students were randomized into two groups and evaluated either by the SG first and then the MCQs, or vice-versa, for a cardiology clinical case. The primary endpoint was score variability evaluated by variance comparison. Secondary endpoints were differences in and correlation between the MCQ and SG results, and student satisfaction. Results: A total of 68 medical students were included. The score variability was significantly higher in the SG group (?2 =265.4) than the MCQs group (?2=140.2; P=.009). The mean score was significantly lower for the SG than the MCQs at 66.1 (SD 16.3) and 75.7 (SD 11.8) points out of 100, respectively (P<.001). No correlation was found between the two test results (R2=0.04, P=.58). The self-reported satisfaction was significantly higher for SG (P<.001). Conclusions: Our study suggests that SGs are more effective in terms of score variability than MCQs. In addition, they are associated with a higher student satisfaction rate. SGs could represent a new evaluation modality for medical students. UR - http://games.jmir.org/2017/2/e11/ UR - http://dx.doi.org/10.2196/games.7033 UR - http://www.ncbi.nlm.nih.gov/pubmed/28512082 ID - info:doi/10.2196/games.7033 ER - TY - JOUR AU - Badran, Hani AU - Pluye, Pierre AU - Grad, Roland PY - 2017/03/14 TI - When Educational Material Is Delivered: A Mixed Methods Content Validation Study of the Information Assessment Method JO - JMIR Med Educ SP - e4 VL - 3 IS - 1 KW - validity and reliability KW - continuing education KW - Internet KW - electronic mail KW - physicians, family KW - knowledge translation KW - primary health care N2 - Background: The Information Assessment Method (IAM) allows clinicians to report the cognitive impact, clinical relevance, intention to use, and expected patient health benefits associated with clinical information received by email. More than 15,000 Canadian physicians and pharmacists use the IAM in continuing education programs. In addition, information providers can use IAM ratings and feedback comments from clinicians to improve their products. Objective: Our general objective was to validate the IAM questionnaire for the delivery of educational material (ecological and logical content validity). Our specific objectives were to measure the relevance and evaluate the representativeness of IAM items for assessing information received by email. Methods: A 3-part mixed methods study was conducted (convergent design). In part 1 (quantitative longitudinal study), the relevance of IAM items was measured. Participants were 5596 physician members of the Canadian Medical Association who used the IAM. A total of 234,196 ratings were collected in 2012. The relevance of IAM items with respect to their main construct was calculated using descriptive statistics (relevance ratio R). In part 2 (qualitative descriptive study), the representativeness of IAM items was evaluated. A total of 15 family physicians completed semistructured face-to-face interviews. For each construct, we evaluated the representativeness of IAM items using a deductive-inductive thematic qualitative data analysis. In part 3 (mixing quantitative and qualitative parts), results from quantitative and qualitative analyses were reviewed, juxtaposed in a table, discussed with experts, and integrated. Thus, our final results are derived from the views of users (ecological content validation) and experts (logical content validation). Results: Of the 23 IAM items, 21 were validated for content, while 2 were removed. In part 1 (quantitative results), 21 items were deemed relevant, while 2 items were deemed not relevant (R=4.86% [N=234,196] and R=3.04% [n=45,394], respectively). In part 2 (qualitative results), 22 items were deemed representative, while 1 item was not representative. In part 3 (mixing quantitative and qualitative results), the content validity of 21 items was confirmed, and the 2 nonrelevant items were excluded. A fully validated version was generated (IAM-v2014). Conclusions: This study produced a content validated IAM questionnaire that is used by clinicians and information providers to assess the clinical information delivered in continuing education programs. UR - http://mededu.jmir.org/2017/1/e4/ UR - http://dx.doi.org/10.2196/mededu.6415 UR - http://www.ncbi.nlm.nih.gov/pubmed/28292738 ID - info:doi/10.2196/mededu.6415 ER - TY - JOUR AU - Ahmed, Laura AU - Seal, H. Leonard AU - Ainley, Carol AU - De la Salle, Barbara AU - Brereton, Michelle AU - Hyde, Keith AU - Burthem, John AU - Gilmore, Samuel William PY - 2016/08/11 TI - Web-Based Virtual Microscopy of Digitized Blood Slides for Malaria Diagnosis: An Effective Tool for Skills Assessment in Different Countries and Environments JO - J Med Internet Res SP - e213 VL - 18 IS - 8 KW - Malaria KW - Virtual microscopy KW - External quality assessment KW - Internet N2 - Background: Morphological examination of blood films remains the reference standard for malaria diagnosis. Supporting the skills required to make an accurate morphological diagnosis is therefore essential. However, providing support across different countries and environments is a substantial challenge. Objective: This paper reports a scheme supplying digital slides of malaria-infected blood within an Internet-based virtual microscope environment to users with different access to training and computing facilities. The feasibility of the approach was established, allowing users to test, record, and compare their own performance with that of other users. Methods: From Giemsa stained thick and thin blood films, 56 large high-resolution digital slides were prepared, using high-quality image capture and 63x oil-immersion objective lens. The individual images were combined using the photomerge function of Adobe Photoshop and then adjusted to ensure resolution and reproduction of essential diagnostic features. Web delivery employed the Digital Slidebox platform allowing digital microscope viewing facilities and image annotation with data gathering from participants. Results: Engagement was high with images viewed by 38 participants in five countries in a range of environments and a mean completion rate of 42/56 cases. The rate of parasite detection was 78% and accuracy of species identification was 53%, which was comparable with results of similar studies using glass slides. Data collection allowed users to compare performance with other users over time or for each individual case. Conclusions: Overall, these results demonstrate that users worldwide can effectively engage with the system in a range of environments, with the potential to enhance personal performance through education, external quality assessment, and personal professional development, especially in regions where educational resources are difficult to access. UR - http://www.jmir.org/2016/8/e213/ UR - http://dx.doi.org/10.2196/jmir.6027 UR - http://www.ncbi.nlm.nih.gov/pubmed/27515009 ID - info:doi/10.2196/jmir.6027 ER - TY - JOUR AU - Alber, M. Julia AU - Bernhardt, M. Jay AU - Stellefson, Michael AU - Weiler, M. Robert AU - Anderson-Lewis, Charkarra AU - Miller, David M. AU - MacInnes, Jann PY - 2015/09/23 TI - Designing and Testing an Inventory for Measuring Social Media Competency of Certified Health Education Specialists JO - J Med Internet Res SP - e221 VL - 17 IS - 9 KW - social media KW - health education KW - professional competence N2 - Background: Social media can promote healthy behaviors by facilitating engagement and collaboration among health professionals and the public. Thus, social media is quickly becoming a vital tool for health promotion. While guidelines and trainings exist for public health professionals, there are currently no standardized measures to assess individual social media competency among Certified Health Education Specialists (CHES) and Master Certified Health Education Specialists (MCHES). Objective: The aim of this study was to design, develop, and test the Social Media Competency Inventory (SMCI) for CHES and MCHES. Methods: The SMCI was designed in three sequential phases: (1) Conceptualization and Domain Specifications, (2) Item Development, and (3) Inventory Testing and Finalization. Phase 1 consisted of a literature review, concept operationalization, and expert reviews. Phase 2 involved an expert panel (n=4) review, think-aloud sessions with a small representative sample of CHES/MCHES (n=10), a pilot test (n=36), and classical test theory analyses to develop the initial version of the SMCI. Phase 3 included a field test of the SMCI with a random sample of CHES and MCHES (n=353), factor and Rasch analyses, and development of SMCI administration and interpretation guidelines. Results: Six constructs adapted from the unified theory of acceptance and use of technology and the integrated behavioral model were identified for assessing social media competency: (1) Social Media Self-Efficacy, (2) Social Media Experience, (3) Effort Expectancy, (4) Performance Expectancy, (5) Facilitating Conditions, and (6) Social Influence. The initial item pool included 148 items. After the pilot test, 16 items were removed or revised because of low item discrimination (r<.30), high interitem correlations (?>.90), or based on feedback received from pilot participants. During the psychometric analysis of the field test data, 52 items were removed due to low discrimination, evidence of content redundancy, low R-squared value, or poor item infit or outfit. Psychometric analyses of the data revealed acceptable reliability evidence for the following scales: Social Media Self-Efficacy (alpha=.98, item reliability=.98, item separation=6.76), Social Media Experience (alpha=.98, item reliability=.98, item separation=6.24), Effort Expectancy(alpha =.74, item reliability=.95, item separation=4.15), Performance Expectancy (alpha =.81, item reliability=.99, item separation=10.09), Facilitating Conditions (alpha =.66, item reliability=.99, item separation=16.04), and Social Influence (alpha =.66, item reliability=.93, item separation=3.77). There was some evidence of local dependence among the scales, with several observed residual correlations above |.20|. Conclusions: Through the multistage instrument-development process, sufficient reliability and validity evidence was collected in support of the purpose and intended use of the SMCI. The SMCI can be used to assess the readiness of health education specialists to effectively use social media for health promotion research and practice. Future research should explore associations across constructs within the SMCI and evaluate the ability of SMCI scores to predict social media use and performance among CHES and MCHES. UR - http://www.jmir.org/2015/9/e221/ UR - http://dx.doi.org/10.2196/jmir.4943 UR - http://www.ncbi.nlm.nih.gov/pubmed/26399428 ID - info:doi/10.2196/jmir.4943 ER -