Background: ChatGPT has gained global attention recently owing to its high performance in generating a wide range of information and retrieving any kind of data instantaneously. ChatGPT has also been tested for the United States Medical Licensing Examination (USMLE) and has successfully cleared it. Thus, its usability in medical education is now one of the key discussions worldwide.
Objective: The objective of this study is to evaluate the performance of ChatGPT in medical biochemistry using clinical case vignettes.
Methods: The performance of ChatGPT was evaluated in medical biochemistry using 10 clinical case vignettes. Clinical case vignettes were randomly selected and inputted in ChatGPT along with the response options. We tested the responses for each clinical case twice. The answers generated by ChatGPT were saved and checked using our reference material.
Results: ChatGPT generated correct answers for 4 questions on the first attempt. For the other cases, there were differences in responses generated by ChatGPT in the first and second attempts. In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases that were used. But, to our surprise, for case 3, different answers were obtained with multiple attempts. We believe this to have happened owing to the complexity of the case, which involved addressing various critical medical aspects related to amino acid metabolism in a balanced approach.
Conclusions: According to the findings of our study, ChatGPT may not be considered an accurate information provider for application in medical education to improve learning and assessment. However, our study was limited by a small sample size (10 clinical case vignettes) and the use of the publicly available version of ChatGPT (version 3.5). Although artificial intelligence (AI) has the capability to transform medical education, we emphasize the validation of such data produced by such AI systems for correctness and dependability before it could be implemented in practice.
A new powerful artificial intelligence (AI)–driven large language model called “ChatGPT” has gained increasing attention. Within 3 months of its launch, ChatGPT has attracted over millions of users with its ability to generate astounding and diverse conversations based on enormous amounts of data, and achieve milestones by performing well on competitive medical examinations [, ]. This impressive conversational chatbot was developed by OpenAI (San Francisco, California) on November 30, 2022, and is currently funded by Microsoft and others [ ], having significantly impacted the field of education. However, there are conflicting reactions among educators globally regarding ChatGPT’s amazing capacity to perform difficult tasks in education because this development in AI appears to completely transform current educational practices [ ].
In the medical science context, ChatGPT is believed to be able to reshape medical education, research, and clinical decision management by rapidly creating content to learn, providing quick access to information, and creating a personalized learning experiences . Recently, ChatGPT had also cleared the United States Medical Licensing Examination (USMLE) with an acceptable score, thus reinforcing the usability of such AI models to enhance medical education [ , ]. However, literature about the performance of ChatGPT in biochemistry and its ability to interpret clinical conditions and provide valuable contributions to medical education is lacking. Therefore, we aimed to assess the diagnostic and interpretation ability of ChatGPT using clinical case vignettes in medical biochemistry.
ChatGPT’s performance was evaluated in clinical biochemistry using 10 clinical case vignettes. We used ChatGPT’s version 3.5 without the Plus subscription. The 10 clinical case vignettes in medical biochemistry were randomly selected from Biochemistry and Genetics PreTestTM Self-Assessment and Review, Third Edition , wherein the correct answers and subsequent explanations are also available; this was used as the reference material [ ] to evaluate ChatGPT-generated answers. All clinical case vignettes were in the format of clinical case–based multiple-choice questions and were chosen from chapters on carbohydrate metabolism, lipid metabolism, amino acid metabolism, heme metabolism, and acid-based equilibria. All vignettes were typed exactly with the same options per our reference material [ ] in ChatGPT’s input field. ChatGPT-generated responses were saved and documented. The reference material [ ] was used to check ChatGPT-generated answers and explanations. For all 10 clinical cases, ChatGPT chose 1 option from the multiple choices and provided an explanation for the answers. The correctness of ChatGPT-generated answers was checked using the answers and explanation as provided in the reference material [ ] by 2 expert faculty members (with postgraduate qualifications and considerable teaching experience in medical biochemistry) independently to avoid bias. All the answers provided in the reference material [ ] were cross-referenced with the standard biochemistry textbooks including Harper's Illustrated Biochemistry (31st edition) [ ] and Lippincott Illustrated Reviews: Biochemistry [ ]. All vignettes used for this were numbered 1 through 10. All the answers were rechecked twice by typing the same question and regenerating the responses. However, while conducting this study, ChatGPT was not informed about the incorrect responses it had generated, although it is considered standard practice to provide an opportunity to chatbots to acknowledge its errors. ChatGPT was only used to obtain the responses for the clinical case vignettes; it was not used to write any part of the manuscript.
The weightage of clinical cases is shown in.
|Chapter||Weightage, %||Case numbers|
|Carbohydrate metabolism||20||1 and 6|
|Lipid metabolism||30||4, 8, and 9|
|Amino acid metabolism||20||3 and 7|
|Acid-base equilibria||20||3 and 5|
In the first attempt, upon evaluating the answers using our reference material , out of the 10, ChatGPT provided the correct answers for 4 questions and incorrect answers for 6 questions. ChatGPT-generated answers matched our answer key for 4 questions (cases 4, 6, 7, and 10), and the explanation provided was also in accordance with the one provided in our reference material [ ]. There were discrepancies between ChatGPT-generated answers and original answer keys for 6 questions (cases 1, 2, 3, 5, 8, and 9). In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases used. Questions for which a correct answer was generated in the first attempt had the same correct answer in the second attempt (cases 4, 6, 7, and 10). Answers to the other 6 questions—for which ChatGPT generated incorrect answers in the first attempt—were changed, and in the second attempt, correct answers were generated for 2 questions in accordance with our reference material (cases 5 and 9) [ ]. Three of the questions answered incorrectly in the first attempt again had the same incorrect answers in the second attempt (cases 1, 2, and 8). Surprisingly, in 1 case (case 3), multiple answers were obtained on each attempt. This could be attributed to the complexity of the case scenario, stemming from the need to address multiple critical medical facets about amino acid metabolism; this case required a delicate balance of clinical knowledge, surgical expertise, understanding of neonatal nutrition, and awareness of amino acid essentiality to ensure the best treatment outcome. The clinical cases used are summarized in , and the answers in our reference material [ ] and ChatGPT-generated answers are presented in . The results of this study are presented with image answers generated by ChatGPT in the first attempt ( - ). Discrepancies in answers with different answers provided in multiple attempts in case 3 are presented in - .
Clinical case 1
A teenager is brought in by his parents after his physical education teacher gives him a failing grade. The teacher has scolded him for malingering because he drops out of activities after a few minutes of exercise complaining of leg cramps and fatigue. A stress test is arranged with sampling of blood metabolites and monitoring of exercise performance which of the following results after exercise would support diagnosis of glycogen storage disease in this teenager?
- Increased oxalate, decreased glucose
- Increased glycerol and glucose
- Increased lactate and glucose
- Increased pyruvate and stable glucose
- Stable lactate and glucose
Clinical case 2
A male infant does well in the nursery but seems to have a reaction to serial introduced at age 6 weeks the infant begins vomiting severely often spewing vomitus across the crib (projectile vomiting). Concern about food allergy persists until an experienced surgeon sits with her hand over the infant stomach for 20 minutes at the bedside, feeling a small oval shape that has been described as an olive. The surgeon obtains electrolytes and blood gases preparatory to anaesthesia which of the combinations of laboratory results below and their interpretation are most likely for this infant?
- Low Pco2, normal bicarbonate, normal chloride, high pH – pure respiratory alkalosis
- Low Pco2, low bicarbonate, low pH, low chloride – compensated metabolic acidosis
- Normal Pco2, low bicarbonate, low pH, normal chloride – pure metabolic acidosis
- High Pco2, normal bicarbonate, low pH, normal chloride – pure respiratory acidosis
- Normal Pco2, high bicarbonate, high pH, low chloride- pure metabolic alkalosis
Clinical case 3
A newborn with meconium ileus (plugging of the small intestine with meconium or fetal stool) is found to have air in the bowel wall (pneumatosis intestinalis) and free air in the abdomen. Antibiotics are begun for suspected peritonitis and emergency surgery is performed to remove the diseased intestinal segment and heal the intestinal perforation that led to air in the abdomen. Because the gut must be kept at rest for healing meconium peritonitis was usually fatal until parental alimentation solutions were developed. Hyperalimentation consists of essential amino acids and other metabolites that provide a positive calorie balance while keeping the bowel at rest. The alimentation solution must be kept to a minimum of metabolites because of its high osmotic load that necessitates frequent changing of intravenous sites catheterization of a large vein. Which of the following amino acids could be excluded from the alimentation solution?
Clinical case 4
A 2-year-old girl has been healthy until the past weekend when she contracted a viral illness at day care with vomiting, diarrhea and progressive lethargy. She presents to the office on Monday with disorientation, a barely rousable sensorium, cracked lips, sunken eyes, lack of tears, flaccid skin with “tenting” on pinching, weak pulse with low blood pressure and increased deep tendon reflexes. Laboratory tests show low blood glucose, normal electrolytes, elevated liver enzymes and (on chest X ray) a dilated heart. Urinalysis reveals no infection and no ketones. The child is hospitalised and stabilised with 10% glucose infusion and certain admission laboratories come back 1 week later showing elevated medium chain fatty acyl carnitines in blood and 6 to 8 carbon di carboxylic acids in the urine the most likely disorder in this child involves which of the following?
- Defect of medium chain coenzyme a dehydrogenase
- Defect of medium chain fatty acid synthetase
- Mitochondrial defect in the electron transport chain
- Mitochondrial defect in fatty acid transport
- Carnitine deficiency
Clinical case 5
A 2-day-old neonate becomes lethargic and uninterested in breastfeeding. Physical examination reveals hypotonia (low muscle tone), muscle twitching that suggests seizures and tachypnea (rapid breathing). The child has a normal heart beat and breath sounds with no indication of cardio respiratory disease. Initial blood chemistry values include normal glucose, sodium, potassium, chloride and bicarbonate (HCO3-) levels; initial blood gas values reveal a pH of 7.53, partial pressure of oxygen (PO2) normal at 103 mmHg and partial pressure of carbon dioxide (PCO2) decreased at 27 mmHg. Which of the following treatment strategies is most appropriate?
- Administer alkali to treat metabolic acidosis
- Administer alkali to treat respiratory acidosis
- Decrease the respiratory rate to treat metabolic acidosis
- Decrease the respiratory rate to treat respiratory alkalosis
- Administer acid to treat metabolic alkalosis
Clinical case 6
After a term uncomplicated gestation, normal delivery, and unremarkable nursery stay, a 10 day old female is readmitted to the hospital because of poor feeding, weight loss, and rapid heart rate. Antibiotics are started as a precaution against sepsis, and initial testing indicates an unusual echo cardiogram with a very short PR interval and a large heart on X ray. initial concern about a cardiac arrhythmia changes when a large tongue is noted, causing concern about glycogen storage disease type 2 (Pompe disease-232300-table3). Which of the following best explains why Pompe disease is more severe and lethal compared to other glycogen storage diseases?
- The deficiency is a degradative rather than synthetic enzyme
- The deficiency involves a liver enzyme
- The deficiency involves a lysosomal enzyme
- The deficiency causes associated neutropenia
- The deficiency involves a serum enzyme
Clinical case 7
An adolescent female develops hemiballismus (repetitive throwing motion of the arms )after anesthesia for a routine operation. She is tall and lanky and it is noted that she and her sister both had previous operations for dislocated lenses of the eyes. The symptoms are suspicious for the disease homocystinuria (236300). Which of the following statements is descriptive of this disease?
- Patients may be treated with dietary supplements of vitamin B 12
- Patients may be treated with dietary supplements of vitamin C
- There is deficient excretion of homocysteine
- There is increased excretion of cysteine
- There is a defect in the ability to form cystathionine from homocysteine and serine
Clinical case 8
Children with very long or long chain fatty acid oxidation disorders are severely affected from birth, while those with short or medium chain oxidation defects may be asymptomatic until they have an intercurrent illness that causes prolonged fasting. the severe symptoms of longer chain diseases are best explained by which of the following statements?
- Longer chain fatty acids inhibit gluconeogenesis and deplete serum glucose needed for brain metabolism
- Glycogen is the main fuel reserve of the body but is quickly depleted with fasting
- Starch is an important source of glucose and is inhibited by high fatty acid concentration
- Triacylglycerol are the main fuel reserve of the body and are needed for energy production in actively metabolising tissues
- Longer chain fatty acids form micelles and blocked synapsis
Clinical case 9
A 45-year-old man is found to have an elevated serum cholesterol of 300 mg percent measured by standard conditions after a 12-hour fast. Which of the following lipoproteins would contribute to a measurement of plasma cholesterol in a normal person following a 12 hour fast?
- Very-low-density lipoprotein (VLDL) and low-density lipoproteins (LDL)
- High-density lipoproteins (HDL) and low-density lipoproteins (LDL)
- Chylomicrons and very-low-density lipoproteins (VLDL)
- Chylomicron remnants and very-low-density lipoproteins (VLDL)
- Low-density lipoproteins (LDL) and adipocyte lipid droplets
Clinical case 10
35 year-old-man presents to the emergency room with an acute abdomen (severe abdominal pain with tightness of muscles, decreased bowel sounds and vomiting and/or diarrhea). He has been drinking, and a urine sample is unusual because it has a port-wine colour. past history indicates several prior evaluations for abdominal pain, including and appendectomy. The physician notes unusual neurological symptoms with partial paralysis of his arms and legs. at first concerned about food poisons like Botulism, the physician recalls that acute intermittent porphyria may cause these symptoms (176000) and consult a gastroenterologist. Elevation of which of the following urinary metabolites would support a diagnosis of porphyria?
- Urobilinogen and bilirubin
- Delta-aminolevulinic acid and porphobilinogen
- Biliverdin and stercobilin
- Urobilin and urobilinogen
- Delta-aminolevulinic acid and urobilinogen
|Clinical case number||Answer in reference materiala||Answer generated by ChatGPT||Correctness of the answer generated by ChatGPT|
|First attempt||Second attempt|
|3||A||B||Second attempt: C; third attempt: E; fourth attempt: none||Different answers in multiple attempts|
|5||D||None||D||First attempt: incorrect; second attempt: correct|
|9||B||A||B||First attempt: incorrect; second attempt: correct|
aResponse options indicated as A through E.
Our evaluation of ChatGPT’s performance in medical biochemistry yielded average results. ChatGPT’s performance cannot be regarded as high owing to numerous discrepancies between ChatGPT-generated answers and the original answer key . Also, the difference between ChatGPT-chosen options in the first and subsequent attempts indicates that as the complexity of the content increased, the precision of the generated answers decreased, emphasizing the need to verify the answers generated by this chatbot before its implementation. Hence, validating the information generated is crucial before we can completely rely on such AI-powered tools.
Large language models such as ChatGPT may enhance student engagement and learning by assisting in web-based learning by generating pertinent and comprehensive content . Assessment of ChatGPT’s knowledge of microbiology in competency-based medical education provided impressive results with an 80% accuracy rate in answering first-order and second-order knowledge questions [ ]. ChatGPT also performed well in diagnosing and interpreting a case scenario in clinical toxicology. However, medicine functions beyond the capacity to provide a correct diagnosis and relevant information. ChatGPT cannot replace the human ability of eliciting history and take prompt actions [ ].
ChatGPT’s acceptance as an effective learning tool in medical education is still a debate. On comparing the knowledge and interpretation skills of medical students and ChatGPT in a parasitology examination, the correctness of answers and acceptability of explanations were lower for ChatGPT-generated responses than for medical students’ answers . In the context of the development of medical education curricula, the performance of ChatGPT in outlining content for sessions on lipid metabolism and generating learning objectives and evaluation questions was not highly commendable, indicating the need to verify the information and beware of misleading or incorrect information that could be possibly generated by these AI tools [ ].
Thus, diversity in ChatGPT’s performance in various medical sciences is a major limitation for AI to be accepted as a productive learning platform for students and educators and to be successfully used to reframe medical education and research . But, ChatGPT is certainly a highly beneficial asset that can be used to achieve several milestones if used with caution and proper authentication [ ]. Thus, more studies should focus on testing ChatGPT in various fields of medicine to assess its performance and frame appropriate regulations in the implementation of AI-based systems in medical education and research.
This study has certain limitations. First, only 10 clinical case vignettes were used to assess ChatGPT’s potential in solving them. Owing to the smaller sample size, more detailed studies would be required to confirm and disseminate the findings of this study. Further, only the publicly available version of ChatGPT (version 3.5) was used. Thus, ChatGPT’s performance and the quality of responses are limited to the scope of this version.
This study analyzed the performance of ChatGPT in medical biochemistry using clinical case vignettes. From the results of this study, it is certain that before we use the content generated by AI innovations such as ChatGPT, it is important to assess the reliability and accuracy of the information provided. As huge amounts of data are being handled by AI tools, misinformation or disinformation are the most common issues encountered. However, ChatGPT undoubtedly has a high potential to enhance teaching, learning, and assessment strategies in the field of medical education. Although AI cannot replace humans, chatbots such as ChatGPT have good prospects for advancing medical education under expert surveillance. As this is a rapidly advancing field, newer and upgraded versions can be expected to be released with higher accuracy and with minimal errata. Hence, the scope of future research should be widened with the aim of approving AI-generated content with validity and reliability. Once this is achieved, ChatGPT will have the potential to emerge as the most rapid and efficient information-generating tool that can certainly transform the medical education system.
The author would like to thank Dr Golder N Wilson, the author of the book Biochemistry and Genetics PreTestTM Self-Assessment and Review, Third Edition (2007), for granting permission to use the clinical cases provided in the book for this study and to generate the responses to the case vignettes in ChatGPT. Author would also like to extend their gratitude to OpenAI, a US-based artificial intelligence research laboratory for providing free access to ChatGPT.
The data that support this study are available upon request from the corresponding author.
Conflicts of Interest
Case 1 ChatGPT performance.PNG File , 612 KB
Case 2ChatGPT performance.PNG File , 996 KB
Case 3 ChatGPT performance.PNG File , 628 KB
Case 4 ChatGPT performance.PNG File , 647 KB
Case 5 ChatGPT performance.PNG File , 490 KB
Case 6 ChatGPT performance.PNG File , 433 KB
Case 7 ChatGPT performance.PNG File , 381 KB
Case 8 ChatGPT performance.PNG File , 495 KB
Case 9 ChatGPT performance.PNG File , 346 KB
Case 10 ChatGPT performance.PNG File , 320 KB
Case 3 ChatGPT performance – 2nd attempt.PNG File , 694 KB
Case 3 ChatGPT performance– 3rd attempt.PNG File , 717 KB
Case 3 ChatGPT performance – 4th attempt.PNG File , 829 KB
Case 3 ChatGPT performance – 4th attempt (Contd...).PNG File , 418 KB
- Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023 Feb 08;9:e45312 [https://mededu.jmir.org/2023//e45312/] [CrossRef] [Medline]
- Helberger N, Diakopoulos N. ChatGPT and the AI Act. Internet Policy Rev 2023;12(1) [https://doi.org/10.14763/2023.1.1682] [CrossRef]
- Kurian N, Cherian JM, Sudharson NA, Varghese KG, Wadhwa S. AI is now everywhere. Br Dent J 2023 Jan;234(2):72 [https://doi.org/10.1038/s41415-023-5461-1] [CrossRef] [Medline]
- Baidoo-Anu D, Owusu Ansah L. Education in the era of generative artificial intelligence (AI): understanding the potential benefits of ChatGPT in promoting teaching and learning. SSRN J 2023 [https://ssrn.com/abstract=4337484] [CrossRef]
- Khan R, Jawaid M, Khan A, Sajjad M. ChatGPT - Reshaping medical education and clinical management. Pak J Med Sci 2023;39(2):605-607 [https://europepmc.org/abstract/MED/36950398] [CrossRef] [Medline]
- Kung T, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023 Feb;2(2):e0000198 [https://europepmc.org/abstract/MED/36812645] [CrossRef] [Medline]
- Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel) 2023 Mar 19;11(6) [https://www.mdpi.com/resolver?pii=healthcare11060887] [CrossRef] [Medline]
- Wilson GN. Biochemistry and Genetics PreTestTM Self-Assessment and Review, Third Edition. New York, NY: McGraw Hill Professional; 2007.
- Rodwell VW, Murray RK. In: Rodwell VW, Bender DA, Botham KM, Kennelly PJ, Weil P, editors. Harper's Illustrated Biochemistry, 31st edition. New York, NY: McGraw Hill; 2018.
- Abali EE, Cline SD, Franklin DS, Viselli SM. Lippincott Illustrated Reviews: Biochemistry. Philadelphia, PA: Wolters Kluwer Health; 2021.
- Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ 2023 Mar 14:E [https://doi.org/10.1002/ase.2270] [CrossRef] [Medline]
- Das D, Kumar N, Longjam L, Sinha R, Deb Roy A, Mondal H, et al. Assessing the capability of ChatGPT in answering first- and second-order knowledge questions on microbiology as per competency-based medical education curriculum. Cureus 2023 Mar;15(3):e36034 [https://europepmc.org/abstract/MED/37056538] [CrossRef] [Medline]
- Sabry Abdel-Messih M, Kamel Boulos MN. ChatGPT in clinical toxicology. JMIR Med Educ 2023 Mar 08;9:e46876 [https://mededu.jmir.org/2023//e46876/] [CrossRef] [Medline]
- Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof 2023;20:1 [https://europepmc.org/abstract/MED/36627845] [CrossRef] [Medline]
- Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky S. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. medRxiv. Preprint posted online February 21, 2023 [CrossRef]
- Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst 2023 Mar 04;47(1):33 [https://europepmc.org/abstract/MED/36869927] [CrossRef] [Medline]
- Kitamura F. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology 2023 Apr;307(2):e230171 [https://doi.org/10.1148/radiol.230171] [CrossRef] [Medline]
|AI: artificial intelligence|
|USMLE: United States Medical Licensing Examination|
Edited by G Eysenbach, T de Azevedo Cardoso; submitted 11.03.23; peer-reviewed by B Meskó, R Fatteh, F Tume; comments to author 25.05.23; revised version received 29.05.23; accepted 21.09.23; published 07.11.23Copyright
©Krishna Mohan Surapaneni. Originally published in JMIR Medical Education (https://mededu.jmir.org), 07.11.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.