Letter to the Editor
Comment on: https://mededu.jmir.org/2023/1/e45312
We thank Epstein and Dexter  for their close reading of our paper, “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” [ ]. In response to their comments, we present the following points for clarification:
- While search engines such as Bing (Microsoft Corp) and Google (Google LLC) have been noted to implement geographic tuning when presenting their information retrieval results, there is no evidence or documentation that the version of ChatGPT (OpenAI) used in our work similarly alters its output given the geolocation of the user or the device that is being used. Notably, however, the integration of ChatGPT into other online services, such as Bing or Snapchat (Snap Inc), has made the information provided to those services (eg, time zone or geolocation) available to ChatGPT [ ].
- Additionally, although it may be true that (dialectic) grammatical differences in the English language result in variability that may mimic the variability of prompt engineering, there is no empirical evidence that this alters the performance of ChatGPT. Further examination of the correlation between prompt engineering methods and within-sentence grammatical tuning or variability may alleviate these concerns in future research.
- Although it is a medical knowledge–based examination, the American Board of Preventive Medicine Longitudinal Assessment Program pilot for clinical informatics is not equivalent to the USMLE (United States Medical Licensing Examination). ChatGPT’s performance on this maintenance of certification examination has been examined by Kumah-Crystal et al [ ], and we defer to their assessment as a more apt comparator.
- While Epstein and Dexter [ ] offer a comparison between ChatGPT 3.5, ChatGPT 4.0, and Google Bard, it is unclear as to how the three have been statistically compared in terms of sample size and answer quality beyond performance on multiple-choice questions. Bootstrapping responses appear to address an element of variability in large language model (LLM) responses; however, a more robust statistical comparison is warranted alongside a comparison of nonbinarized LLM output performance.
- While there is no doubt that there is variability in the responses of LLMs to identical inputs (as these tools are nondeterministic in character), we do not believe this devalues the statistical significance or the quantitative validity of our results. As we are evaluating the performance of ChatGPT in the same situation as a student examinee, a single response is more applicable. Additionally, since we used a large sample size of questions, which accounted for model variability, we elected not to repeat questions multiple times.
Conflicts of Interest
- Epstein R, Dexter F. Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”. JMIR Med Educ 2023;9:e48305 [https://mededu.jmir.org/2023/1/e48305/] [CrossRef]
- Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023 Feb 08;9:e45312 [https://doi.org/10.2196/45312] [CrossRef] [Medline]
- How my AI uses location data. Snapchat Support. URL: https://archive.is/wcmk3 [accessed 2023-06-25]
- Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU. ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? J Am Med Inform Assoc 2023 Jun 19:104 [CrossRef] [Medline]
|LLM: large language model|
|USMLE: United States Medical Licensing Examination|
Edited by T Leung; This is a non–peer-reviewed article. submitted 27.06.23; accepted 05.07.23; published 13.07.23Copyright
©Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash. Originally published in JMIR Medical Education (https://mededu.jmir.org), 13.07.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.