TY - JOUR AU - Long, Cai AU - Lowe, Kayle AU - Zhang, Jessica AU - Santos, André dos AU - Alanazi, Alaa AU - O'Brien, Daniel AU - Wright, Erin D AU - Cote, David PY - 2024 DA - 2024/1/16 TI - A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study JO - JMIR Med Educ SP - e49970 VL - 10 KW - medical licensing KW - otolaryngology KW - otology KW - laryngology KW - ear KW - nose KW - throat KW - ENT KW - surgery KW - surgical KW - exam KW - exams KW - response KW - responses KW - answer KW - answers KW - chatbot KW - chatbots KW - examination KW - examinations KW - medical education KW - otolaryngology/head and neck surgery KW - OHNS KW - artificial intelligence KW - AI KW - ChatGPT KW - medical examination KW - large language models KW - language model KW - LLM KW - LLMs KW - wide range information KW - patient safety KW - clinical implementation KW - safety KW - machine learning KW - NLP KW - natural language processing AB - Background: ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. Objective: We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. Methods: Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. Results: In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. Conclusions: ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation. SN - 2369-3762 UR - https://mededu.jmir.org/2024/1/e49970 UR - https://doi.org/10.2196/49970 UR - http://www.ncbi.nlm.nih.gov/pubmed/38227351 DO - 10.2196/49970 ID - info:doi/10.2196/49970 ER -