Abbreviations

JME

JMIR Med Educ

JMIR Medical Education

2369-3762

JMIR Publications

Toronto, Canada

v9i1e50336

37440299

10.2196/50336

Letter to the Editor

Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations

Leung

Tiffany

Gilson

Aidan

BS 1 2

https://orcid.org/0000-0002-4770-4705

Safranek

Conrad W

BS 1

https://orcid.org/0000-0003-1985-9432

Huang

Thomas

BS 2

https://orcid.org/0000-0001-9056-7016

Socrates

Vimig

MS 1 3

https://orcid.org/0000-0001-7955-9875

Chi

Ling

BSE 1

https://orcid.org/0000-0002-8270-9245

Taylor

Richard Andrew

MD, MHS 1 2

https://orcid.org/0000-0002-9082-6644

Chartash

David

PhD 1

Section for Biomedical Informatics and Data Science Yale University School of Medicine

100 College Street, 9th Fl

New Haven, CT, 06510

United States 1 203 737 5379 david.chartash@yale.edu

https://orcid.org/0000-0002-0265-330X

1 Section for Biomedical Informatics and Data Science Yale University School of Medicine

New Haven, CT

United States 2 Department of Emergency Medicine Yale University School of Medicine

New Haven, CT

United States 3 Program of Computational Biology and Bioinformatics Yale University

New Haven, CT

United States 4 School of Medicine University College Dublin National University of Ireland, Dublin

Dublin

Ireland

Corresponding Author: David Chartash david.chartash@yale.edu

2023

13 7 2023

e50336

27 6 2023 5 7 2023

©Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash. Originally published in JMIR Medical Education (https://mededu.jmir.org), 13.07.2023.

2023

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.

https://mededu.jmir.org/2023/1/e48305/

https://mededu.jmir.org/2023/1/e45312

natural language processing NLP MedQA generative pre-trained transformer GPT medical education chatbot artificial intelligence AI education technology ChatGPT conversational agent machine learning large language models knowledge assessment

We thank Epstein and Dexter [1] for their close reading of our paper, “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” [2]. In response to their comments, we present the following points for clarification:

While search engines such as Bing (Microsoft Corp) and Google (Google LLC) have been noted to implement geographic tuning when presenting their information retrieval results, there is no evidence or documentation that the version of ChatGPT (OpenAI) used in our work similarly alters its output given the geolocation of the user or the device that is being used. Notably, however, the integration of ChatGPT into other online services, such as Bing or Snapchat (Snap Inc), has made the information provided to those services (eg, time zone or geolocation) available to ChatGPT [3].

Additionally, although it may be true that (dialectic) grammatical differences in the English language result in variability that may mimic the variability of prompt engineering, there is no empirical evidence that this alters the performance of ChatGPT. Further examination of the correlation between prompt engineering methods and within-sentence grammatical tuning or variability may alleviate these concerns in future research.

Although it is a medical knowledge–based examination, the American Board of Preventive Medicine Longitudinal Assessment Program pilot for clinical informatics is not equivalent to the USMLE (United States Medical Licensing Examination). ChatGPT’s performance on this maintenance of certification examination has been examined by Kumah-Crystal et al [4], and we defer to their assessment as a more apt comparator.

While Epstein and Dexter [1] offer a comparison between ChatGPT 3.5, ChatGPT 4.0, and Google Bard, it is unclear as to how the three have been statistically compared in terms of sample size and answer quality beyond performance on multiple-choice questions. Bootstrapping responses appear to address an element of variability in large language model (LLM) responses; however, a more robust statistical comparison is warranted alongside a comparison of nonbinarized LLM output performance.

While there is no doubt that there is variability in the responses of LLMs to identical inputs (as these tools are nondeterministic in character), we do not believe this devalues the statistical significance or the quantitative validity of our results. As we are evaluating the performance of ChatGPT in the same situation as a student examinee, a single response is more applicable. Additionally, since we used a large sample size of questions, which accounted for model variability, we elected not to repeat questions multiple times.

Abbreviations

LLM

large language model

USMLE

United States Medical Licensing Examination

None declared.

Epstein

Dexter

Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”

JMIR Med Educ 2023 9 e48305

10.2196/48305

Gilson

Safranek

Huang

Socrates

Chi

Taylor

Chartash

How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment

JMIR Med Educ 2023 02 08 9 e45312

10.2196/45312

36753318

v9i1e45312

PMC9947764

How my AI uses location data

Snapchat Support 2023-06-25

https://archive.is/wcmk3

Kumah-Crystal

Mankowitz

Scott

Embi

Peter

Lehmann

Christoph U

ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification?

J Am Med Inform Assoc 2023 06 19 104

10.1093/jamia/ocad104

37335851

7202064