This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.
Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input.
This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability.
We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question.
Of the 4 data sets,
ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the
Chat Generative Pre-trained Transformer (ChatGPT) [
ChatGPT is the latest among a class of large language models (LLMs) known as autoregressive language models [
Within the medical domain, LLMs have been investigated as tools for personalized patient interaction and consumer health education [
In this paper, we aimed to quantify ChatGPT’s performance on examinations that seek to assess the primary competency of medical knowledge—established and evolving biomedical, clinical, epidemiological, and social-behavioral science knowledge—and a facet of its application to patient care through the use of 2 data sets centered around knowledge tested in the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 Clinical Knowledge exams. Step 1 focuses on foundational sciences and their relation to the practice of medicine, whereas Step 2 focuses on the clinical application of those foundational sciences. USMLE Step 3 was excluded as it is intended to assess skills and capacity for independent generalist medical practice rather than foundational knowledge. We also compared the performance of ChatGPT on these examinations to the performances of 2 previously mentioned LLMs, GPT-3 and InstructGPT. In addition, to further assess the ability of ChatGPT to serve as a simulated medical tutor, we qualitatively examined the integrity of ChatGPT’s responses with regard to logical justification and the use of intrinsic and extrinsic information.
We created 2 pairs of data sets to examine ChatGPT’s understanding of medical knowledge related to Step 1 and Step 2. We first selected a subset of 100 questions from AMBOSS, a widely used question bank that contains over 2700 Step 1 and 3150 Step 2 questions [
We also used the list of 120 free Step 1 and Step 2 Clinical Knowledge questions developed by the National Board of Medical Examiners (NBME), which we call
Due to the significant impact that prompt engineering has been shown to have on generative LLM output, we standardized the input formats of the
Template of question posed to each large language model (LLM), including both AMBOSS
We first recorded all correct answers as they appeared in the
We then qualified the ChatGPT responses for each question using 3 binary variables characteristic of narrative coherence [
Logical reasoning: The response clearly identifies the logic in selecting between answers given the information presented in the response.
Internal information: The response uses information internal to the question, including information about the question in the response.
External information: The response uses information external to the question, including but not limited to qualifying the answers given or the stem.
Finally, for each question answered incorrectly, we labeled the reason for the incorrect answer as one of the following options:
Logical error: The response adequately found the pertinent information but did not properly convert the information to an answer.
Example: Identifies that a young woman has been having difficulty with taking pills routinely and still recommends oral contraceptives over an intrauterine device.
Information error: ChatGPT either did not identify a key piece of information, whether present in the question stem or through external information, that would be considered expected knowledge.
Example: Recommends antibiotics for sinusitis infection, believing most cases to be of bacterial etiology even when the majority are viral.
Statistical error: An error centered around an arithmetic mistake. This includes explicit errors, such as stating “1 + 1 = 3,” or indirect errors, such as an incorrect estimation of disease prevalence.
Example: Identifies underlying nephrolithiasis but misclassifies the prevalence of different stone types.
All authors who performed qualitative analysis of the responses (AG, CWS, RAT, and DC) worked collaboratively, and all uncertain labels were reconciled.
All analysis was conducted in Python software (version 3.10.2; Python Software Foundation). Unpaired chi-square tests were used to determine whether question difficulty significantly affected ChatGPT’s performance on the
The performance of the 3 large language models (LLMs) on the 4 outlined data sets.
LLM, response | |||||
|
|||||
|
Correct | 56 (64.4) | 59 (57.8) | 44 (44) | 42 (42) |
|
Incorrect | 31 (35.6) | 43 (42.2) | 56 (56) | 58 (58) |
|
|||||
|
Correct | 45 (51.7) | 54 (52.9) | 36 (36) | 35 (35) |
|
Incorrect | 42 (48.3) | 48 (47.1) | 64 (64) | 65 (65) |
|
|||||
|
Correct | 22 (25.3) | 19 (18.6) | 20 (20) | 17 (17) |
|
Incorrect | 65 (74.7) | 83 (81.4) | 80 (80) | 83 (83) |
aNBME: National Board of Medical Examiners.
bChatGPT: Chat Generative Pre-trained Transformer.
From
ChatGPT’sa performance on AMBOSS-Step1 and AMBOSS-Step2 data sets by question.
Step, tip, response | Overall, n (%) | Question difficulty, n (%) | ||||||||||
|
|
|
|
1 | 2 | 3 | 4 | 5 |
|
|||
|
||||||||||||
|
|
|||||||||||
|
|
Correct | 44 (44) | 9 (64.3) | 16 (59.3) | 13 (40.6) | 6 (33.3) | 0 (0) | .01 | |||
|
|
Incorrect | 56 (56) | 5 (35.7) | 11 (40.7) | 19 (59.4) | 12 (66.7) | 9 (100) |
|
|||
|
|
|||||||||||
|
|
Correct | 56 (56) | 10 (71.4) | 16 (59.3) | 21 (65.6) | 7 (38.9) | 2 (22.2) | .06 | |||
|
|
Incorrect | 44 (44) | 4 (28.6) | 11 (40.7) | 11 (34.4) | 11 (61.1) | 7 (77.8) |
|
|||
|
||||||||||||
|
|
|||||||||||
|
|
Correct | 42 (42) | 15 (60) | 10 (43.5) | 11 (40.7) | 3 (18.8) | 3 (33.3) | .13 | |||
|
|
Incorrect | 58 (58) | 10 (40) | 13 (56.5) | 16 (59.3) | 13 (81.2) | 6 (66.7) |
|
|||
|
|
|||||||||||
|
|
Correct | 53 (53) | 17 (68) | 15 (65.2) | 12 (44.4) | 7 (43.8) | 2 (22.2) | .08 | |||
|
|
Incorrect | 47 (47) | 8 (32) | 8 (34.8) | 15 (55.6) | 9 (56.2) | 7 (77.8) |
|
aChatGPT: Chat Generative Pre-Trained Transformer.
Finally, in
Qualitative analysis of ChatGPT’sa response quality for NBMEb-Free-Step1 and NBME-Free-Step2.
Metric |
|
|
|||||
Overall (n=87), n (%) | Correct (n=56), n (%) | Incorrect (n=31), n (%) | Overall (n=102), n (%) | Correct (n=59), n (%) | Incorrect (n=43), n (%) | ||
|
|||||||
True | 87 (100) | 56 (100) | 31 (100) | 102 (100.0) | 59 (100) | 43 (100) | |
False | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | |
|
|||||||
True | 84 (96.6) | 55 (98.2) | 29 (93.5) | 99 (97.1) | 59 (100) | 40 (93) | |
False | 3 (3.4) | 1 (1.8) | 2 (6.5) | 3 (2.9) | 0 (0) | 3 (7) | |
|
|||||||
True | 67 (77) | 52 (92.9) | 15 (48.4) | 80 (78.4) | 53 (89.8) | 27 (62.8) | |
False | 20 (23) | 4 (7.1) | 16 (51.6) | 22 (21.6) | 6 (10.2) | 16 (37.2) | |
|
|||||||
Logical error | —c | — | 13 (41.9) | — | — | 16 (37.2) | |
Information error | — | — | 7 (22.6) | — | — | 13 (30.2) | |
Statistical error | — | — | 2 (6.5) | — | — | 1 (2.3) | |
Logical and information errors | — | — | 9 (29) | — | — | 13 (30.2) |
aChatGPT: Chat Generative Pre-Trained Transformer.
bNBME: National Board of Medical Examiners.
cNot applicable.
One of the key features touted by the advancement of ChatGPT is its ability to understand context and carry on a conversation that is coherent and relevant to the topic at hand. In this paper, we have shown that this extends into the medical domain by evaluating ChatGPT on 4 unique medical knowledge competency data sets, framing conversation as question answering. We found that the model is capable of correctly answering up to over 60% of questions representing topics covered in the USMLE Step 1 and Step 2 licensing exams. A threshold of 60% is often considered the benchmark passing standards for both Step 1 and Step 2, indicating that ChatGPT performs at the level expected of a third-year medical student. Additionally, our results demonstrate that even in the case of incorrect answers, the responses provided by the model always contained a logical explanation for the answer selection, and greater than 90% of the time, this response directly included information contained in the question stem. Correct answers were found to contain information external to the question stem significantly more frequently (given a threshold of
Prior work in the field of medical question answering research has often been focused on more specific tasks with the intent of improving model performance at the expense of generalizability. For example, Jin et al [
This dialogic nature is what separates ChatGPT from previous models in its ability to act as an educational tool. InstructGPT performed at an accuracy above random chance, although still below ChatGPT on all data sets. However, even if InstructGPT performed at an accuracy equal to ChatGPT, the responses InstructGPT provided were not as conducive to student education. InstructGPT’s responses were frequently only the selected answer with no further explanation, and it is impossible to ask follow-up questions to gain more context. As InstructGPT is not formatted as a dialogic system, the model will often continue the prompt rather than provide a distinct answer. For example, a prompt ending in “G) Delirium” will be extended into “tremens B) Dislodged otoliths” before an answer is provided. GPT-3 suffers from similar fallbacks and requires more prompt engineering to generate the desired output [
One potential use case to highlight for the use of ChatGPT is as an adjunct or surrogate for small (peer) group education. Small group education has been shown to be a highly efficacious method of teaching [
This study has several limitations. First, ChatGPT was first trained on a corpus that was created from data produced on or before 2021. This limits the model’s prompts to contain only information found prior to that date. Second, due to the closed nature of this model and the lack of a public API, we are unable to fine-tune this model on task-specific data and examine the breadth of its inherent stochasticity. However, this work investigates ChatGPT’s in-context performance on USMLE Step 1 and 2 exams, so these limitations did not hamper our analysis. Third, ChatGPT is receiving regular updates, believed to be the result of training on inputs as they are provided by users. The version of ChatGPT used in this study was not the most up-to-date model at the time of publication. All said, it is reasonable to hypothesize that each new iteration of the model will not cause a significant reduction in model performance on the task we have outlined and, in fact, may cause an increase in performance.
In conclusion, our results suggest that ChatGPT performs at a level expected of a third-year medical student on the assessment of the primary competency of medical knowledge. Furthermore, the tool has potential as an innovation within the context of small group education in medicine (such as for problem-based learning or clinical problem-solving). ChatGPT’s responses to questions provide interpretable context to justify the written response of the model in most cases, suggesting narrative coherence. By providing sufficiently accurate dialogic responses akin to human learners, the model may facilitate the creation of an on-demand, interactive learning environment for students, which has the potential to support problem-solving and externally supported reflective practice.
To provide additional value to the reader as to the functioning of ChatGPT, a second conclusion to this paper was written by the tool using the prompt “Write a conclusion for the following manuscript:” with the entire text of the manuscript up to but excluding the handwritten conclusion included in the prompt:
In conclusion, this study aimed to assess the performance of the ChatGPT language model on examinations which evaluate medical knowledge and reasoning, as well as its ability to serve as a virtual medical tutor. The study used two pairs of datasets, AMBOSS-Step1 and AMBOSS-Step2, and NBME-Free-Step1 and NBME-Free-Step2, to examine ChatGPT’s understanding of medical knowledge related to USMLE Step 1 and Step 2 Clinical Knowledge exams. The results of the study showed that ChatGPT’s performance on the medical knowledge examination was superior to that of GPT-3 and InstructGPT, and that the additional context provided by the
Spreadsheet of all questions, annotations, and ChatGPT responses for all four datasets.
application programming interface
Chat Generative Pre-trained Transformer
large language model
National Board of Medical Examiners
United States Medical Licensing Examination
This publication was made possible by the Yale School of Medicine Fellowship for Medical Student Research. Research reported in this publication was supported by the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health under award number T35DK104689. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The data sets analyzed during this study are available in
None declared.