Background

JMIR Med Educ

mededu

JMIR Medical Education

JMIR Med Educ

2369-3762

JMIR Publications

Toronto, Canada

v11i1e65108

10.2196/65108

Original Paper

ChatGPT’s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini

Prazeres

Filipe

MD, MSc, PhD123

Faculty of Health Sciences, University of Beira Interior

Av. Infante D. Henrique

Covilhã

PortugalFamily Health Unit Beira Ria

Gafanha da Nazaré

PortugalCINTESIS@RISE, Department of Community Medicine, Information and Health Decision Sciences, Faculty of Medicine of the University of Porto

Porto

Portugal

Lesselroth

Blake

Thies

Bill

Digiampietri

Luciano Antonio

Pellegrino

Raffaele

Correspondence to Filipe Prazeres, MD, MSc, PhD, Faculty of Health Sciences, University of Beira Interior, Av. Infante D. Henrique, Covilhã, 6201-506, Portugal, 351 234393150; filipeprazeresmd@gmail.com

2025

532025

e65108

050820243011202412122024

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.

Background

Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness.

Objective

This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso à Formação Especializada [PNA]) and compares their performance to human candidates.

Methods

ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, “Are you sure?” after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models’ performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05.

Results

ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance.

Conclusions

This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.

ChatGPT-3.5 TurboChatGPT-4o minimedical examinationEuropean PortugueseAI performance evaluationPortugueseevaluationmedical examination questionsexamination questionchatbotChatGPTmodelartificial intelligenceAIGPTLLMNLPnatural language processingmachine learninglarge language model

Introduction

Generative artificial intelligence (AI) represents a branch of AI dedicated to the development of systems that can autonomously generate high-quality digital content on demand, and it can do so across various modalities, such as written text, images, audio, and video [1-3]. Generative AI tools are trained on large datasets, enabling them to produce work that mirrors human-created content [2]. Nowadays, there are several examples of generative AI tools, including ChatGPT (OpenAI Inc), Runway, Gemini (Google Inc), DALL-E (OpenAI Inc), Copilot (Microsoft Inc), Midjourney, NovelAI (Anlatan), Claude (Anthropic), and Jasper AI, among others. ChatGPT, the large language model (LLM) chatbot, developed by OpenAI [4], that started the AI boom in November 2022, became the most popular AI tool of 2023, accounting for over 60.2% of visits between September 2022 and August 2023, with a total of 14.6 billion website visits [5]. ChatGPT’s availability as a free-to-use, low-bandwidth service may reduce disparities compared to paid versions or models by making advanced AI technology accessible to a broader and more diverse global population [6], contributing to making it the most popular generative AI tool [7].

Recent literature reviews regarding AI have shown that this type of technology has potential applications in several fields, spanning from the architecture, engineering, and construction industry to health care [8-11]. The possible applications in medicine are substantial, ranging from diagnostic and treatment support (eg, clinical imaging improvement, classification of diseases, prediction of disease onset, development of treatment, and medication prescriptions) [12] to facilitate communication and engagement between medical professionals and their patients [13], and also improving medical education and its accessibility [10,14,15]. For example, ChatGPT can be used as a study tool to clearly explain complex medical concepts [16,17] (eg, radiology reports [18]), create memory aids for challenging topics, clarify medical practice questions, summarize research articles, compile lists of differential diagnoses [17], generate medical examination questions [19], and simulate physician-patient interactions [14].

Medical written examinations are an important part in evaluating the competence and knowledge of medical students and graduates (eg, access of physicians to specialized training, such is the case in Portugal). These examinations not only test factual knowledge but also evaluate the critical thinking and problem-solving skills of the candidates. With the recent growing interest in AI, an important question arises: Can AI, specifically ChatGPT, perform at a level comparable to human candidates in medical written examinations? By evaluating ChatGPT’s ability to correctly answer medical questions, its medical proficiency and its potential role as an educational tool can be assessed. Successfully completing this task can demonstrate ChatGPT’s capability to serve as a resource for medical students by providing continuous access to information, particularly benefiting students in remote or under-resourced areas [6].

ChatGPT is known for having the capability of performing near the passing threshold of 60% accuracy of the United States Medical Licensing Examination (USMLE) [20] and for approximately having the knowledge equivalent to a third-year medical student [21]. ChatGPT’s performance on medical examinations has been analyzed across different countries and questions. A 2023 systematic review with a meta-analysis of 19 articles found a mean performance of ChatGPT of around 61% [22], and a more recent review published in 2024 concluded that, despite ChatGPT’s satisfactory performance in examinations, further studies are necessary to fully explore its potential in medical education [23].

Furthermore, ChatGPT struggles with non-English language assessments possibly due to a limited understanding of linguistic nuances and Western-centric internet data, which may not fully represent the clinical and disease differences in some countries, like African and Asian populations [24], warranting more studies in other languages to ensure better understanding of ChatGPT’s accuracy in diverse cultural contexts. For example, ChatGPT performed considerably lower on a medical examination in Chinese (45.8% correct answers on the Chinese National Medical Licensing Examination) [25], and even worse in the French examination with 22% correct answers [26].

In July 2024, OpenAI launched GPT-4o mini, a smaller version of its latest GPT-4o (“o” for “omni”) AI language model. This new model replaced GPT-3.5 Turbo in ChatGPT, making this an ideal time to study the performance of both free models in resolving written medical examinations.

This study aims to evaluate the performance and consistency of 2 AI models, ChatGPT-3.5 Turbo and ChatGPT-4o mini, in solving the questions of a non-English language (European Portuguese) written medical examination, with a format of multiple-choice with one best answer—the 2023 National Examination for Access to Specialized Training (Prova Nacional de Acesso à Formação Especializada [PNA])—and compare their performance to that of human candidates.

MethodsStudy Design

The PNA examination is part of the requirements for entering specialized medical training in Portugal. Its purpose is to rank candidates for accessing specialized training vacancies, so no minimum passing grade is needed [27].

The PNA questions used in this study were from the actual 2023 Portuguese PNA examination, which is publicly available on the web [27]. This examination includes 150 questions with 5 multiple-choice answers each, with only a single best answer, similar to the USMLE. The questions are based on clinical vignettes and divided into 2 parts with 75 questions each. The examination emphasizes clinical reasoning and the application and integration of clinical knowledge and is scored on a scale from 0 to 150 points, with no penalties for blank or incorrect answers. It covers various medical disciplines, including medicine, surgery, pediatrics, gynecology and obstetrics, and psychiatry. The examination duration is 240 minutes, divided into 2 parts of 120 minutes each [27].

ChatGPT-3.5 Turbo was provided with the first part of the examination (74 no image-based multiple-choice questions [MCQs]) on July 18, 2024, and ChatGPT-4o mini with the second part of the examination (74 no image-based MCQs) on July 19, 2024. The questions were entered into the models in European Portuguese and in a format similar to how they are presented to human candidates, and each model was requested to provide a single-letter answer, just like human candidates. For each question, the models generated an answer using their natural language processing capabilities. Following each model’s response, a follow-up question, “Are you sure?” was asked to test for consistency—this technique was previously used by Brin et al [28]. An example of the input format of the questions and the respective responses by ChatGPT in European Portuguese is depicted in Table 1, with corresponding translations to English performed by ChatGPT-4o mini. Each question was addressed in a new chat session to reduce the potential influence of memory retention bias of ChatGPT.

Table 1.

Example of the input format of the questions and the respective responses by ChatGPT.

	Examination question in European Portuguese	Examination question translated to English (performed by ChatGPT-4o mini)
Question	Um homem de 73 anos vem à consulta hospitalar para reavaliação de doença pulmonar obstrutiva crónica. Na consulta prévia, há seis meses, apresentava-se em estadio GOLD B. Refere agora, desde há três meses, agravamento da dispneia para esforços médios, sem alteração do padrão habitual de tosse ou de expetoração. Nega febre, perda de apetite ou outras queixas de novo. A história médica revela ainda obesidade. A medicação habitual inclui brometo de tiotrópio e salmeterol. É ex-fumador de 40 UMA desde há 10 anos. Os sinais vitais são temperatura 36 °C, frequência respiratória 18/min, frequência cardíaca 78/min e pressão arterial 115/89 mm Hg; SpO2 94% (ar ambiente). Ao exame físico apresenta cianose labial, com aparência confortável e atrofia muscular na área temporal. A auscultação pulmonar revela crepitações raras dispersas, sem prolongamento do tempo expiratório. O restante exame físico, incluindo a auscultação cardíaca, encontra-se dentro dos parâmetros de normalidade.Qual das seguintes alternativas mais provavelmente reduz o risco de morte relacionada com esta doença? Azitromicina. Montelucaste. Oxigenoterapia. Prednisolona. Teofilina. Indique apenas a letra da resposta, sem justificação.	A 73-year-old man comes to the hospital for reevaluation of chronic obstructive pulmonary disease (COPD). At the previous visit six months ago, he was in GOLD stage B. He now reports worsening shortness of breath with moderate exertion over the past three months, without changes in his usual cough or sputum production. He denies fever, loss of appetite, or any new complaints. His medical history also reveals obesity. His usual medications include tiotropium bromide and salmeterol. He is a former smoker with a 40 pack-year history, quit 10 years ago. Vital signs are temperature 36 °C, respiratory rate 18/min, heart rate 78/min, and blood pressure 115/89 mm Hg; SpO2 94% (room air). On physical examination, he shows lip cyanosis, appears comfortable, and has muscle atrophy in the temporal area. Lung auscultation reveals scattered rare crackles, with no prolonged expiratory phase. The rest of the physical examination, including cardiac auscultation, is within normal limits.Which of the following is most likely to reduce the risk of death related to this condition? Azithromycin. Montelukast. Oxygen therapy. Prednisolone. Theophylline. Please indicate only the letter of the answer, without justification.
ChatGPT-3.5 Turbo answer	A resposta correta é:(C) Oxigenoterapia.	The correct answer is:(C) Oxygen therapy.
Follow-up question	Tem a certeza?	Are you sure?
ChatGPT-3.5 Turbo answer	Peço desculpa pelo erro anterior. Vamos rever a situação:Dadas as características do paciente com DPOC grave e hipoxemia (SpO2 de 94% em ar ambiente), a intervenção que mais provavelmente reduzirá o risco de morte relacionada com esta doença é a oxigenoterapia. Portanto, a resposta correta é:(C) Oxigenoterapia.	I apologize for the previous error. Let’s review the situation:Given the characteristics of the patient with severe COPD and hypoxemia (SpO2 of 94% on room air), the intervention that is most likely to reduce the risk of death related to this condition is oxygen therapy. Therefore, the correct answer is:(C) Oxygen therapy.

Human candidates’ scores were obtained from the official records of the PNA 2023 examination, which are also publicly available on the web [27].

To compare the mean scores of human candidates to those of the language models, all questions of the PNA 2023 examination had to be answered. Since the examination included 2 questions using images (one in the first part and another one in the second part; both with electrocardiogram strips), these questions were answered by GPT-4o, as it can handle images in addition to text.

Ethical Considerations

This study exclusively used data that had been previously published online and did not involve direct interaction with human participants. As a result, ethical guidelines pertaining to human participants are not applicable.

Statistical Analysis

Analyses were performed using IBM SPSS Statistics (Version 21). The McNemar test [29] with continuity correction [30] was used to determine differences between the first and second responses of ChatGPT-3.5 Turbo and ChatGPT-4o mini. Single-parameter t test was used to compare the performance of ChatGPT-3.5 Turbo and ChatGPT-4o mini with that of human candidates. Frequencies and percentages were used for categorical variables and means and CIs for numerical variables. Statistical significance was considered at P<.05.

ResultsOverall Performance and Consistency

In the initial response with ChatGPT-3.5 Turbo, of the 74 questions, 40 (54%) answers were correct and 34 (46%) answers were incorrect. After the follow-up question, “Are you sure?,” the number of correct answers decreased to 28 (38%), while the number of incorrect answers increased to 46 (62%). This change occurred because ChatGPT-3.5 Turbo corrected 12 originally incorrect answers, but also changed 24 originally correct answers to incorrect. This pattern of change approached, but did not reach, significance (χ²₁=3.361, P=.067).

Initially, of the 74 questions, ChatGPT-4o mini produced 48 (65%) correct answers and 26 (35%) incorrect answers. After being asked, “Are you sure?,” the correct answers dropped to 42 (57%), while incorrect answers rose to 32 (43%). This change occurred because ChatGPT-4o mini fixed 12 previously wrong answers but also changed 18 previously correct answers to incorrect. This pattern of change was not statistically significant (χ²₁=0.833, P=.361).

The 2 questions using images (one in the first part and another one in the second part) were answered correctly by GPT-4o.

LLM Chatbot Versus Human

When evaluating AI capabilities in relation to human abilities, LLM responses in part 1 of PNA (74 questions resolved by ChatGPT-3.5 Turbo plus 1 by GPT-4o) showed lower accuracy than human respondents. The human mean score was statistically significantly higher by 6.04 (95% CI 5.65-6.43) than the LLM score of 41 (P<.001).

In part 2 of PNA (74 questions resolved by ChatGPT-4o mini added to 1 question by GPT-4o), the LLM score showed higher accuracy than human respondents. The human mean score was statistically significantly lower by 5.58 (95% CI 5.25-5.9) than the LLM score of 49 (P<.001).

DiscussionPrincipal Findings

This study analyzes the performance of 2 ChatGPT models (ChatGPT-3.5 Turbo and ChatGPT-4o mini) on the Portuguese medical written examination: 2023 National Examination for Access to Specialized Training, revealing important differences in accuracy and consistency. Although, both ChatGPT-3.5 Turbo and ChatGPT-4o mini answered correctly in the majority of the questions, ChatGPT-4o mini achieved a higher accuracy rate of 65% (48/74) compared to ChatGPT-3.5 Turbo’s 54% (40/74), demonstrating a superior capability in handling medical questions. Additionally, ChatGPT-4o mini showed greater consistency in confirming answers, highlighting its reliability. When evaluated against human respondents, ChatGPT-4o mini outperformed the average human accuracy, while ChatGPT-3.5 Turbo fell short.

Strengths

This study stands out for its innovative approach in analyzing the performance of ChatGPT-3.5 Turbo and ChatGPT-4o mini in a medical examination context. It is the first to evaluate these models using an examination conducted in a less commonly studied language, Portuguese, thereby broadening the scope of language-specific AI assessments. By incorporating the actual scores of human candidates for comparison, the study provides a robust benchmark against real-world performance. Furthermore, the research examines the stability of the AI’s answers by repeatedly asking “Are you sure?,” offering valuable insights into the consistency of the responses.

Comparison to Prior Work

A recent study evaluated ChatGPT’s performance on medical licensing examinations across multiple countries (United States, Italy, France, Spain, United Kingdom, and India) and determined a variable accuracy, ranging from 22% on the French examination to 73% on the Italian examination [26]. In this study, ChatGPT answered correctly in more than 50% of the Portuguese medical examination questions, positioning it next to the countries with better performance. For example, in a Turkish study, ChatGPT reached 70.9% accuracy in the medical specialty examination [31]. In the Iranian medical licensing examination, ChatGPT performed with 68.5% of the questions answered correctly [32]. And in Poland, ChatGPT achieved a 67.1% correct response rate on the Polish medical specialization licensing examination [33].

When analyzing the differences between the 2 ChatGPT versions, ChatGPT-4o mini outperformed ChatGPT-3.5 Turbo in this study: 65% (48/74) vs 54% (40/74) correct response rate. This suggests that advancements in the underlying architecture and training data of ChatGPT-4o mini (knowledge up to October 2023) have improved its capability to understand and respond to medical questions with more accuracy. Previous studies evaluating the performance of different ChatGPT models found that ChatGPT-4 consistently performed better compared to ChatGPT-3.5. For example, ChatGPT-4 outperformed ChatGPT-3.5 on the Polish Medical Final Examination [34], the Spanish Medical Residency Entrance Examination (Médico Interno Residente) [35], the 2023 Japanese Nursing Examination [36], the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina) [37], and in the USMLE soft skill assessments [28], to name a few. Nonetheless, ChatGPT-4 is a paid model and thus not accessible to everyone, which is not the case for the most recent free-to-use ChatGPT-4o mini.

Another important aspect is consistency. The results of this study revealed that ChatGPT-3.5 Turbo was less stable when asked to confirm its original answers. These results are consistent with those of Brin et al [28], who found that ChatGPT-3.5 altered its answers 82.5% of the time in the USMLE assessments [28]. Unfortunately, in this study, it was not shown that by changing the original answers, ChatGPT-3.5 Turbo improves its accuracy. This contrasts with studies on human students, which have shown that changing their answers usually improves their test scores [38]. One can wonder, since the “awareness of what one knows and does not know depends in part on how much one knows” [39], does ChatGPT-3.5 Turbo change its answers because it does not know, or does it simply change answers to satisfy the user when prompted?

When evaluating the AI models against human respondents, it was found that in part 2 of the PNA examination (74 questions resolved by ChatGPT-4o mini plus 1 question by GPT-4o), the LLM outperformed the average accuracy of human participants. In contrast, in part 1 of the PNA examination (74 questions resolved by ChatGPT-3.5 Turbo plus 1 question by GPT-4o), LLM showed lower accuracy than human respondents. This indicates that while earlier versions, like ChatGPT-3.5 Turbo, may have required a high degree of human oversight, more recent and advanced versions, like ChatGPT-4o mini, have the potential to match or exceed human performance in medical domains. Although no previous studies have analyzed the performance of ChatGPT-4o mini, and no direct comparisons can be made, some studies have already noted that LLMs outperformed human candidates in several medical examinationinations (eg, the German Medical State Examinations of 2022 [40], part 1 of the Fellowship of the Royal College of Ophthalmologists MCQ examination [41], and the University of Toronto Family Medicine Residency Progress Test [42]).

Limitations

This study has several limitations regarding the performance evaluation of ChatGPT-3.5 Turbo and ChatGPT-4o mini. The analysis was based solely on ChatGPT’s indication of the correct answer, which, while aligning with expectations for human candidates, does not consider other aspects of examination performance. Additionally, the grading did not account for the complexity or length of the questions, providing an incomplete assessment of the models’ performance. Further studies should incorporate a more comprehensive evaluation framework that considers the reasoning process and evaluates performance across a broader range of question types and difficulties.

Future Perspectives

This study highlights the importance of continuous improvement in ChatGPT models to further enhance their reliability and accuracy. The superior performance of ChatGPT-4o mini compared to its predecessor offers promising applications in medical education. Its higher accuracy and consistency suggest that it could serve as an effective tool for training medical students. However, a broader assessment of ChatGPT-4o mini across various tests and real-world scenarios is required, as good performance on a specific test may not indicate abilities for general and reliable medical education usage. Additionally, there are known drawbacks and ethical considerations when using AI applications, including the potential for fabricated, incorrect, or biased information [43]. Other issues include limited training periods and the possibility of providing different answers to the same question depending on how the question is phrased [43]. A recent systematic scoping review by Xu et al [44] advises medical students to use ChatGPT cautiously, cross-checking information with reliable sources and disclosing AI-generated content in their work. Teachers should guide students on the effective and ethical use of ChatGPT, assess its reliability, and explore mixed assessment methods to evaluate student abilities while considering its impact on traditional assignments [44].

Conclusion

On the 2023 Portuguese National Examination for Access to Specialized Training, ChatGPT-4o mini achieved an accuracy rate of 65% (48/74), surpassing ChatGPT-3.5 Turbo. This demonstrates a superior capability in handling medical questions. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance. This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the importance of careful implementation with teacher oversight and further research.

This study was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia, I.P.) within CINTESIS R&D Unit (reference UIDB/4255/2020) and within the scope of the project RISE, Associated Laboratory (reference LA/P/0053/2020). During the preparation of this manuscript, the author used ChatGPT-4o mini in order to improve the language of the manuscript and correct grammatical errors. After using this tool, the author reviewed and edited the content as needed and takes full responsibility for the content of the publication.

None declared.

Abbreviations

artificial intelligence

LLM

large language model

MCQ

multiple-choice question

PNA

Prova Nacional de Acesso à Formação Especializada

USMLE

United States Medical Licensing Examination

References1

Feuerriegel

Hartmann

Janiesch

Zschech

Generative AI

Bus Inf Syst Eng202402661111126

10.1007/s12599-023-00834-7

Ramdurai

Adhithya

The impact, advancements and applications of generative AI

Int J Comput Sci Eng202310618

10.14445/23488387/IJCSE-V10I6P101

Cao

Liu

A comprehensive survey of AI-generated content (AIGC): a history of generative AI from GAN to ChatGPT

arXivPreprint posted online on Mar 7, 2023

10.48550/arXiv.2303.04226

Introducing ChatGPT

OpenAI20221130

2024-07-30

https://openai.com/index/chatgpt

Conte

Ranked: the most popular AI tools

Visual Capitalist20240124

2025-02-19

https://www.visualcapitalist.com/ranked-the-most-popular-ai-tools

Wang

Sanders

Liu

ChatGPT: promise and challenges for deployment in low- and middle-income countries

Lancet Reg Health West Pac20231241100905

10.1016/j.lanwpc.2023.100905

37731897

Aydin

Karaarslan

Is ChatGPT leading generative AI? What is beyond expectations?

Acad Platform J Eng Smart Sys2023113118134

10.21541/apjess.1293702

BuHamdan

Alwisy

Bouferguene

Generative systems in the architecture, engineering and construction industry: a systematic review and analysis

Int J Archit Comput202109193226249

10.1177/1478077120934126

Ali

Abdelbaki

Shrestha

Elbasi

Alryalat

MAA

Dwivedi

A systematic literature review of artificial intelligence in the healthcare sector: benefits, challenges, methodologies, and functionalities

J Innov Knowl20230181100333

10.1016/j.jik.2023.100333

Younis

Eisa

TAE

Nasser

A systematic review and meta-analysis of artificial intelligence tools in medicine and healthcare: applications, considerations, limitations, motivation and challenges

Diagnostics (Basel)2024014141109

10.3390/diagnostics14010109

38201418

Ruksakulpiwat

Thorngthip

Niyomyart

A systematic review of the application of artificial intelligence in nursing care: where are we, and what’s next?

J Multidiscip Healthc20241716031616

10.2147/JMDH.S459946

38628616

Bitkina

Park

Kim

Application of artificial intelligence in medical technologies: a systematic review of main trends

Digit Health20239

10.1177/20552076231189331

37485326

Sedaghat

Early applications of ChatGPT in medical practice, education and research

Clin Med (Lond)202305233278279

10.7861/clinmed.2023-0078

37085182

Gandomani

ChatGPT in medical education: how we can use in medical education: challenges and opportunities

J Multidiscip Care202312112

10.34172/jmdc.1232

Sani

Enhancing medical education with ChatGPT: a promising tool for the future

Can J Med20240416114

10.33844/cjm.2024.6032

Hosseini

Gao

Liebovitz

An exploratory survey about using ChatGPT in education, healthcare, and research

PLoS One20231810e0292216

10.1371/journal.pone.0292216

37796786

Guo

Harnessing the power of ChatGPT in medical education

Med Teach2023094591063

10.1080/0142159X.2023.2198094

37036161

Jeblick

Schachtner

Dexl

ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports

Eur Radiol20240534528172825

10.1007/s00330-023-10213-1

37794249

Cheung

BHH

Lau

GKK

Wong

GTC

ChatGPT versus human in generating medical graduate exam multiple choice questions—a multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom)

PLoS One2023188e0290691

10.1371/journal.pone.0290691

37643186

Kung

Cheatham

Medenilla

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

PLOS Digit Health20230222e0000198

10.1371/journal.pdig.0000198

36812645

Gilson

Safranek

Huang

How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment

JMIR Med Educ20230289e45312

10.2196/45312

36753318

Levin

Horesh

Brezinov

Meyer

Performance of ChatGPT in medical examinations: a systematic review and a meta-analysis

BJOG2024021313378380

10.1111/1471-0528.17641

37604703

Sumbal

Amir

Can ChatGPT-3.5 pass a medical exam? A systematic review of ChatGPT’s performance in academic testing

J Med Educ Curric Dev202411

10.1177/23821205241238641

38487300

Cherif

Moussa

Missaoui

Salouage

Mokaddem

Dhahri

Appraisal of ChatGPT’s aptitude for medical education: comparative analysis with third-year medical students in a pulmonology examination

JMIR Med Educ2024072310e52818

10.2196/52818

39042876

Wang

Gong

Wang

ChatGPT performs on the Chinese National Medical Licensing Examination

J Med Syst2023081547186

10.1007/s10916-023-01961-0

37581690

Alfertshofer

Hoch

Funk

Sailing the seven seas: a multinational comparison of ChatGPT’s performance on medical licensing examinations

Ann Biomed Eng20240652615421545

10.1007/s10439-023-03338-3

37553555

Prova Nacional de Acesso à Formação Especializada 2023 Perguntas Frequentes

ACSS2023

2024-08-01

https://www.acss.min-saude.pt/wp-content/uploads/2018/09/FAQ_PNA2023.pdf

Brin

Sorin

Vaid

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Sci Rep202310113116492

10.1038/s41598-023-43436-9

37779171

McNemar

Note on the sampling error of the difference between correlated proportions or percentages

Psychometrika194706122153157

10.1007/BF02295996

20254758

Edwards

Note on the correction for continuity in testing the significance of the difference between correlated proportions

Psychometrika194809133185187

10.1007/BF02289261

18885738

Oztermeli

ChatGPT performance in the medical specialty exam: an observational study

Medicine (Baltimore)2023081110232e34673

10.1097/MD.0000000000034673

37565917

Ebrahimian

Behnam

Ghayebi

Sobhrakhshankhah

ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model

BMJ Health Care Inform20231211301e100815

10.1136/bmjhci-2023-100815

38081765

Wójcik

Rulkiewicz

Pruszczyk

Lisik

Poboży

Domienik-Karłowicz

Reshaping medical education: performance of ChatGPT on a PES medical examination

Cardiol J2024313442450

10.5603/cj.97517

37830257

Rosoł

Gąsior

Łaba

Korzeniewski

Młyńczak

Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

Sci Rep2023112213120512

10.1038/s41598-023-46995-z

37993519

Guillen-Grima

Guillen-Aguinaga

Evaluating the efficacy of ChatGPT in navigating the Spanish Medical Residency Entrance Examination (MIR): promising horizons for AI in clinical medicine

Clin Pract2023112013614601487

10.3390/clinpract13060130

37987431

Kaneda

Takahashi

Kaneda

Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination

Cureus202308158e42924

10.7759/cureus.42924

37667724

Flores-Cohaila

García-Vicente

Vizcarra-Jiménez

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: cross-sectional study

JMIR Med Educ202309289e48039

10.2196/48039

37768724

Kruger

Wirtz

Miller

Counterfactual thinking and the first instinct fallacy

J Pers Soc Psychol200505885725735

10.1037/0022-3514.88.5.725

15898871

Coutinho

MVC

Thomas

Fredricks-Lowman

Alkaabi

Couchman

Unskilled and unaware: second-order judgments increase with miscalibration for low performers

Front Psychol2024151252520

10.3389/fpsyg.2024.1252520

38952836

Roos

Kasapovic

Jansen

Kaczmarczyk

Artificial intelligence in medical education: comparative analysis of ChatGPT, Bing, and medical students in Germany

JMIR Med Educ20230949e46482

10.2196/46482

37665620

Fowler

Pullen

Birkett

Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions

Br J Ophthalmol2024101081013791383

10.1136/bjo-2023-324091

Huang

KJQ

Meaney

Kemppainen

Punnett

Leung

Assessment of resident and AI chatbot performance on the University of Toronto Family Medicine Residency Progress Test: comparative study

JMIR Med Educ202309199e50514

10.2196/50514

37725411

Wong

RSY

Ming

Raja Ali

The intersection of ChatGPT, clinical medicine, and medical education

JMIR Med Educ202311219e47274

10.2196/47274

37988149

Chen

Miao

Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review

J Educ Eval Health Prof20242166

10.3352/jeehp.2024.21.6

38486402