Search Articles

View query in Help articles search

Search Results (1 to 9 of 9 Results)

Download search results: CSV END BibTex RIS


Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study

The combined expertise of both professionals provided a robust and reliable reference standard against which the LLMs’ responses were compared. The questions derived from the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist for observational study and answers.

Seyma Handan Akyon, Fatih Cagatay Akyon, Ahmet Sefa Camyar, Fatih Hızlı, Talha Sari, Şamil Hızlı

JMIR Med Inform 2024;12:e59258

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

As a professional chatbot tool, Chat GPT uses sampling to predict the next token with varying distribution probabilities, ensuring responses are varied and natural in real-world applications. Zhu et al [17] have found that composite answers derived from repeated questioning can enhance the accuracy of Chat GPT. Typically, 2 or 3 repeated responses are necessary to ensure response stability [18-20].

Shuai Ming, Qingge Guo, Wenjun Cheng, Bo Lei

JMIR Med Educ 2024;10:e52784

Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts

Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts

Moreover, the generation of “hallucinatory” or erroneous responses by GPT raises concerns about nonmedical expert users unintentionally accepting incorrect information as valid [15,16]. Consequently, proposals for regulatory oversight of LLMs have emerged, including the establishment of a new regulatory category specifically addressing LLM-related challenges and risks [4].

Eunbeen Jo, Sanghoun Song, Jong-Ho Kim, Subin Lim, Ju Hyeon Kim, Jung-Joon Cha, Young-Min Kim, Hyung Joon Joo

JMIR Med Educ 2024;10:e51282

Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study

Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study

Finally, to examine the impact of image-based questions on the program’s ability to respond, we compared the responses to text-only questions with those to questions that included figures. We then added an English translation of the text (including the text provided along with figures) and analyzed the difference. Regarding statistical methods, comparisons among 3 or more groups were performed using 1-way ANOVA.

Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki

JMIR Med Educ 2024;10:e57054

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

Chat GPT may serve as a valuable teaching assistant in medical education; however, the inaccuracies in its responses are a significant concern [5,7]. Our current findings suggest that, especially with medical-related images, GPT-4 V should not be relied upon as a primary source of information for medical education or practice. If used, extreme caution should be exercised regarding the accuracy of its responses.

Takahiro Nakao, Soichiro Miki, Yuta Nakamura, Tomohiro Kikuchi, Yukihiro Nomura, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe

JMIR Med Educ 2024;10:e54393

A Generative Pretrained Transformer (GPT)–Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study

A Generative Pretrained Transformer (GPT)–Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study

The prompts were designed to guide GPT’s behavior and ensure it provided medically accurate and relevant responses. Presented in detail next, our prompt included a chatbot-optimized illness script as well as a behavioral instruction prompt. We developed a fictitious medical case in a format that could be posted to GPT. As our learning objective was to take a systematic history, we intended to provide all required details.

Friederike Holderried, Christian Stegemann–Philipps, Lea Herschbach, Julia-Astrid Moldt, Andrew Nevins, Jan Griewatz, Martin Holderried, Anne Herrmann-Werner, Teresa Festl-Wietek, Moritz Mahling

JMIR Med Educ 2024;10:e53961

A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study

A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study

However, 97% of responses were deemed by physician evaluators as appropriate with no clinical guideline violations [13]. Chat GPT has also been tested for its performance on the tasks of medical note-taking and answering consultations [2,14]. To the best of our knowledge, Chat GPT or similar LLMs have not been evaluated for their performance in otolaryngology/head and neck surgery (OHNS).

Cai Long, Kayle Lowe, Jessica Zhang, André dos Santos, Alaa Alanazi, Daniel O'Brien, Erin D Wright, David Cote

JMIR Med Educ 2024;10:e49970

Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study

Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study

However, these accomplishments have been attained exclusively in English, and investigations conducted until 2022 cast doubt on its ability to provide medically reliable responses in non-English languages [6]. On March 14, 2023, Open AI introduced the latest iteration of LLMs, GPT-4 [7,8]. Touted as more reliable and innovative than its predecessor, GPT-3.5, GPT-4 reportedly shows superior performance in non-English languages, particularly in academic and professional contexts [8,9].

Takashi Watari, Soshi Takagi, Kota Sakaguchi, Yuji Nishizaki, Taro Shimizu, Yu Yamamoto, Yasuharu Tokuda

JMIR Med Educ 2023;9:e52202

Assessing the Accuracy and Comprehensiveness of ChatGPT in Offering Clinical Guidance for Atopic Dermatitis and Acne Vulgaris

Assessing the Accuracy and Comprehensiveness of ChatGPT in Offering Clinical Guidance for Atopic Dermatitis and Acne Vulgaris

Across both diseases, 78% (50/64) of Chat GPT responses were correct but inadequate (score ≤2), with 45% (29/64) of answers being fully comprehensive (score 1). No responses were completely inaccurate (score 4). For AD and acne specifically, 88% (28/32) and 66% (21 of 32) of responses were correct but inadequate (score ≤2), and 53% (17/32) and 34% (11/32) were fully comprehensive (score 1), respectively. This broadly indicates acceptable performance of Chat GPT across both conditions.

Nehal Lakdawala, Leelakrishna Channa, Christian Gronbeck, Nikita Lakdawala, Gillian Weston, Brett Sloan, Hao Feng

JMIR Dermatol 2023;6:e50409