Published on in Vol 11 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/76925, first published .
Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

1Department of Pharmacy, Abashiri-Kosei General Hospital, Abashiri, Japan

2Graduate School of Pharmacy, Hokkaido University of Science, 7-Jo 15-4-1 Maeda, Teine, Sapporo, Japan

3Graduate School of Health Sciences, Hokkaido University, Sapporo, Japan

4Graduate School of Engineering, Muroran Institute of Technology, Muroran, Japan

Corresponding Author:

Hidehiko Sakurai, PhD


Background: Generative artificial intelligence (AI) has shown rapid advancements and increasing applications in various domains, including health care. Previous studies have evaluated AI performance on medical license examinations, primarily focusing on ChatGPT. However, the availability of new online chat-based large language models (OC-LLMs) and their potential utility in pharmacy licensing examinations remain underexplored. Considering that pharmacists require a broad range of expertise in physics, chemistry, biology, and pharmacology, verifying the knowledge base and problem-solving abilities of these new models in Japanese pharmacy examinations is necessary.

Objective: This study aimed to assess the performance of 18 OC-LLMs released in 2024 in the 107th Japanese National License Examination for Pharmacists (JNLEP). Specifically, the study compared their accuracy and identified areas of improvement relative to earlier models.

Methods: The 107th JNLEP, comprising 345 questions in Japanese, was used as a benchmark. Each OC-LLM was prompted by the original text-based questions, and images were uploaded where permitted. No additional prompt engineering or English translation was performed. For questions that included diagrams or chemical structures, the models incapable of image input were considered incorrect. The model outputs were compared with publicly available correct answers. The overall accuracy rates were calculated based on subject area (pharmacology and chemistry) and question type (text-only, diagram-based, calculation, and chemical structure). Fleiss’ κ was used to measure answer consistency among the top-performing models.

Results: Four flagship models—ChatGPT o1, Gemini 2.0 Flash, Claude 3.5 Sonnet (new), and Perplexity Pro—achieved 80% accuracy, surpassing the official passing threshold and average examinee score. A significant improvement in the overall accuracy was observed between the early and the latest 2024 models. Marked improvements were noted in text-only and diagram-based questions compared with those of earlier versions. However, the accuracy of chemistry-related and chemical structure questions remains relatively low. Fleiss’ κ among the 4 flagship models was 0.334, which suggests moderate consistency but highlights variability in more complex questions.

Conclusions: OC-LLMs have substantially improved their capacity to handle Japanese pharmacists’ examination content, with several newer models achieving accuracy rates of >80%. Despite these advancements, even the best-performing models exhibit an error rate exceeding 10%, underscoring the ongoing need for careful human oversight in clinical settings. Overall, the 107th JNLEP will serve as a valuable benchmark for current and future generative AI evaluations in pharmacy licensing examinations.

JMIR Med Educ 2025;11:e76925

doi:10.2196/76925

Keywords



Generative artificial intelligence (AI) development has been remarkable in recent years and has been adopted in many fields, including education and health care. There have been reports that generative AI has been used to summarize clinical texts [1-4] and has been introduced into clinical practice [5,6]. Furthermore, the potential benefits of generative AI in medical education have been explored [7-10], and its usefulness has been demonstrated in the writing and publishing of medical research [11].

In the United States, generative AI has been implemented in 86% of health care organizations [12]. Moreover, approximately 40% of health care professionals use generative AI at their workplaces at least once a week [13]. Correspondingly, online chat-based large language models (OC-LLM) have attracted the attention of many users because of their ease of use. In health care, the use of OC-LLMs can have serious consequences if their performance is inadequate. Therefore, verifying the knowledge base and problem-solving capabilities of OC-LLMs in health care settings is essential.

A wealth of information is available on the web in the medical and health care domains, and OC-LLMs acquire a substantial amount of knowledge during pretraining. In addition to general medical knowledge, pharmacists must have expertise in fields, such as physics and chemistry, which differ from those required by other health care professionals. However, few studies have evaluated the performance of OC-LLMs in pharmacies. The performance of ChatGPT (GPT-3.5 and GPT-4V models) in the Japanese National License Examination for Pharmacists (JNLEP) was evaluated by Sato and Ogasawara [14]. Since then, numerous new OC-LLM services and models have been released in 2024. However, the performance of these newly released models in the field of pharmacy has not been sufficiently evaluated. Furthermore, it was hypothesized that each OC-LLM service (ie, ChatGPT, Gemini, Claude, and Perplexity) has distinct strengths and limitations.

Accordingly, the purpose of this study is to evaluate the performance of various OC-LLMs introduced in 2024 in the field of pharmacy using the JNLEP and to assess performance improvements in the latest models.


Services and Models

The following 18 OC-LLMs, all available as of 2024, were evaluated (Table 1): ChatGPT (7 models), Gemini (4 models), Claude (5 models), and Perplexity (2 models). Claude 3.5 Sonnet (new) was renamed as Claude 3.5 Sonnet in June 2024, and as of January 2025, these models are the most commonly used OC-LLMs. Microsoft Copilot, one of the most popular OC-LLMs [15], was excluded because its underlying engine, GPT-4 (released in 2023), was evaluated in a previous study as the model used in ChatGPT and is mainly used for tasks other than digital browser-based dialogues. Although Copilot has continued to improve in terms of functionality and performance, the details of its current model and update history remain undisclosed. Consequently, this was excluded from the 2024 OC-LLM performance evaluation in this study.

Table 1. Characteristics, release dates, and evaluation dates of the OC-LLMa services and models used in this study.
Service and modelDeprecated or activebUploadable imageRelease datecEvaluation dated
ChatGPT
GPT-3.5DeprecatedNoNovember 2022May 2024
GPT-4ActiveYesSeptember 2023November 2023
GPT-4o miniActiveNoJuly 2024July 2024
GPT-4oActiveYesMay 2024May 2024
o1 miniActiveNoSeptember 2024October 2024
o1 previewDeprecatedNoSeptember 2024September 2024
o1ActiveYesDecember 2024December 2024
Gemini
1.0 ProDeprecatedYesFebruary 2024May 2024
1.5 ProActiveYesMay 2024May 2024
1.5 FlashActiveYesMay 2024August 2024
2.0 Flash ExperimentalActiveYesDecember 2024December 2024
Claude
3 HaikuDeprecatedYesMarch 2024June 2024
3 SonnetDeprecatedYesMarch 2024May 2024
3 OpusActiveYesMarch 2024June 2024
3.5 SonnetActiveYesJune 2024June 2024
3.5 Sonnet (new)ActiveYesOctober 2024November 2024
Perplexity
StandardActiveNoJune 2024November 2024
ProActiveYesJune 2024December 2024

aOC-LLM: online chat-based large language model.

bStatus of each model, whether deprecated or active as of January 1, 2025.

cRelease data of each used model in Japan.

dPerformance evaluation date of each model used in this study.

Japanese National License Examination for Pharmacists

This study used 345 questions from the 107th JNLEP held in February 2022. This dataset is the same as that used by Sato and Ogasawara [14]. The questions in the 107th JNLEP are organized into the following 9 subject categories: physics, chemistry, biology, hygiene, pharmacology, pharmaceuticals, pathophysiology, regulations, and practice. All questions were presented in a multiple-choice format, requiring the selection of 1 or 2 correct answers from the 5 options. The passing criteria for the 107th JNLEP included an overall accuracy of at least 62.9% along with 2 additional conditions. The details of the 107th JNLEP were extensively covered by Sato and Ogasawara [14].

Data Measurement

All OC-LLMs, except for ChatGPT GPT-4, were evaluated for their performance from May to December 2024. The data outcomes for ChatGPT GPT-4 were collected from a preliminary study conducted in November 2023 [14]. For ChatGPT GPT-3.5, a preliminary test was conducted in February 2023. However, a new evaluation was conducted in May 2024 to assess the potential performance improvements in the same model.

For all models, the complete set of questions from the 107th JNLEP was input in Japanese in order of the question numbers. Although response performance can be improved through prompt engineering [16-18], no prompts were used in this study.

For questions that included diagrams or charts, the questions and options were input as text, whereas the diagram or chart portion was input as an image. Some early models could not process the diagrams (Table 1); therefore, these questions were omitted and marked as incorrect.

Data Analysis

The output from each OC-LLM was compared with publicly available correct answers [19] to determine whether the responses were correct or incorrect. An incorrect answer (ie, hallucinations) was defined as a response in which the selected answer differed from the published correct answer, the specified number of answers was not selected, or no answer was provided. Even when the correct option number could not be explicitly identified in the output by the OC-LLMs, the response was considered correct if the selected content matched the correct answer choice. The accuracy of each model was evaluated based on the total number of subjects and question types (text only, including diagrams, calculations, chemical structures, and graphs). Question-type classification was subjectively determined by the researcher based on the content of the questions. Questions with diagrams were also counted as those containing graphs or chemical structures. The calculated questions included text-only and diagram-based questions.

To assess improvements in model accuracy, statistical comparisons were performed between the 3 model outputs (ChatGPT GPT-4, Gemini 1.0 Pro, and Claude 3 Sonnet) released in early 2024 and those of the latest 4 flagship models (ChatGPT o1, Gemini 2.0 Flash Experimental, Claude 3.5 Sonnet [new], and Perplexity Pro).

Answer consistency was used to validate whether the tasks in which the generative AI model excelled or struggled showed similar trends across models based on the highest accuracy model of each service.

Statistical Analysis

A generalized linear mixed model (GLMM) was used to evaluate the accuracy improvements. The correctness of the responses to each question was set as the dependent variable. The model type (early or latest), question type (text-based or diagram-based), and their interactions were specified as fixed effects. Models and questions were included as random effects. Fleiss’ κ [20] was used to assess the consistency of responses. All statistical analyses were performed using R (version 4.4.2; R Foundation for Statistical Computing).


Performance Statistics of AI Models

The performances of 18 generative AI models from 2024 in the pharmaceutical field were evaluated (Table 2). The performance of the 4 flagship models (ChatGPT o1, Gemini 2.0 Flash Experimental, Claude 3.5 Sonnet [new], and Perplexity Pro) was over 80%, which was markedly higher than that of the passing criteria. When reassessed, ChatGPT GPT-3.5 recorded an overall accuracy of 38.8% (134/345), indicating no marked progress from its former performance of 35.4% (122/345), showing no substantial improvement.

Table 2. Overall accuracy of each OC-LLMa on the 107th JNLEPb.
Service and modelCorrect answerscOverall accuracyPassing criteriad
ChatGPT
 GPT-3.51340.388Failed
 GPT-4o mini2150.623Failed
 o1 mini2340.678Passed
 GPT-4e2500.724Passed
 o1 preview2720.788Passed
 GPT-4o2940.852Passed
 o12990.866Passed
Gemini
 1.0 Pro1710.495Failed
 1.5 Flash2420.701Passed
 1.5 Pro2460.713Passed
 2.0 Flash2880.834Passed
Claude
 3 Sonnet1940.562Failed
 3 Haiku2130.617Failed
 3 Opus2600.753Passed
 3.5 Sonnet2930.849Passed
 3.5 Sonnet (new)2970.860Passed
Perplexity
 Standard2280.660Passed
 Pro3010.872Passed

aOC-LLM: online chat-based large language.

bJNLEP: Japanese National License Examination for Pharmacists.

cNumber of correct answers out of all 345 questions in the 107th JNLEP.

dOverall accuracy>62.9%.

eGPT-4 results were obtained from Sato and Ogasawara [14].

For all services, the model enhancements were confirmed to result in increased accuracy. All the models released after September 2024, regardless of whether they were light, medium, or high, met the qualification criteria (Figure 1). All GPT-4 results were obtained from Sato and Ogasawara [14]. The raw data of each model’s item-by-item correctness are presented in Multimedia Appendix 1.

Figure 1. Relationship between release date and accuracy in each online chat-based large language model service.

Performance of AI Models According to Subject and Question Type

By subject, pathophysiology and pharmacology showed high accuracy for all models except for the ChatGPT GPT-3.5 model. In the most recent models, the accuracy in pharmaceuticals and biology improved substantially, whereas in physics and chemistry, only minor improvements were observed (Table 3). In the 4 flagship models, the average accuracy based on subject was lowest for chemistry (10.3/20, 51.3%), followed by physics (15.3/20, 76.3%). All the other subjects achieved an accuracy exceeding 80%.

For questions that consisted of only text, most models exhibited high accuracy, with a few exceptions. Three models (ChatGPT o1 preview, ChatGPT o1, and Perplexity Pro) showed a correct answer rate of over 90%. The accuracy decreased greatly for questions that included diagrams; the average of all 18 models was 36.7% (22.4/61) and was 50.8% (31.0/61) when models that could not input diagrams were excluded. For questions that included figures, Claude 3.5 Sonnet (new) showed the highest accuracy (47/61, 77%). For the calculation questions, the most recent model showed an improvement in accuracy but did not achieve high accuracy for questions that included chemical structures (Table 4).

Table 3. Accuracy and number of correct answers according to subject for each OC-LLMa in the 107th JNLEPb,c.
OC-LLMaSubjectd
Physics (n=20)Chemistry (n=20)Biology (n=20)Hygiene (n=40)Pharmacology (n=40)Pharmaceuticals (n=40)Pathophysiology (n=40)Regulations (n=30)Practice (n=95)
Correct answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracyCorrect answers, nAccuracy
ChatGPT
 GPT-3.540.20030.15050.250130.325190.475110.275270.675100.333420.442
 GPT-4c110.55070.350130.650280.700360.900250.625280.700200.667820.863
 GPT-4o130.650120.600190.950330.825390.975310.775350.875250.833870.916
 GPT-4o mini90.45020.10070.350230.575360.900180.450310.775210.700680.716
 o1 mini120.60050.25090.450260.650320.800250.625340.850160.533750.789
 o1 preview140.70060.300100.500270.675390.975280.700360.900240.800880.926
 o1140.70080.400180.900340.850390.975340.850380.950250.833890.937
Gemini
 1.0 Pro60.30060.300120.600210.525220.550150.375240.600190.633460.484
 1.5 Pro140.70040.200110.550250.625320.800250.625360.900230.767760.800
 1.5 Flash110.55090.450170.850290.725320.800240.600320.800220.733660.695
 2.0 Flash170.850110.550130.650350.875370.925330.825350.875270.900800.842
Claude
 3 Sonnet90.45070.350140.700300.750280.70030.075320.800180.600530.558
 3 Haiku80.40080.400110.550260.650320.800210.525310.775210.700550.579
 3 Opus110.55030.150160.800300.750350.875270.675340.850240.800800.842
 3.5 Sonnet150.75080.400160.800360.900370.925320.800360.900280.933850.895
 3.5 Sonnet (new)140.700100.500190.950350.875380.950330.825360.900250.833870.916
Perplexity
 Standard110.55020.10060.300230.575380.950250.625340.850210.700680.716
 Pro16)0.800120.600180.900350.875401.000330.825380.950280.933810.853

aOC-LLM: online chat-based large language model.

bJNLEP: Japanese National License Examination for Pharmacists.

cGPT-4 results were obtained from Sato and Ogasawara [14].

dThe mean (SD) for physics is 0.581 (0.172), chemistry is 0.342 (0.162), biology is 0.650 (0.223), hygiene is 0.707 (0.151), pharmacology is 0.849 (0.147), pharmaceuticals is 0.615 (0.211), pathophysiology is 0.829 (0.095), regulations is 0.735 (0.15), and practice is 0.765 (0.157).

Table 4. Accuracy and number of correct answers based on question type for each model of OC-LLMa in the 107th JNLEPbcd
OC-LLMaQuestion type
Text (n=284)Diagram (n=61)Calculation (n=18)Graph (n=16)Chemical structure (n=19)
Answers, nAccuracyAnswers, nAccuracyAnswers, nAccuracyAnswers, nAccuracyAnswers, nAccuracy
ChatGPT
 GPT-3.51340.47200.00050.27800.00000.000
 GPT-4c2270.799220.36190.50050.31340.211
 GPT-4o2540.894390.639120.66770.438100.526
 GPT-4o mini2150.75700.00080.44400.00000.000
 o1 mini2310.81300.000150.83300.00000.000
 o1 preview2710.95400.000150.83300.00000.000
 o12600.915390.639160.88990.56370.368
Gemini
 1.0 Pro1580.556130.21340.22240.25020.105
 1.5 Pro2330.820130.21390.50020.12530.158
 1.5 Flash2110.743310.50870.38970.43870.368
 2.0 Flash2460.866420.689130.72290.563100.526
Claude
 3 Sonnet1820.641280.45990.50090.56350.263
 3 Haiku1900.669270.443120.66770.43850.263
 3 Opus2280.803230.37750.27860.37540.211
 3.5 Sonnet2500.880420.689150.833120.75080.421
 3.5 Sonnet (new)2500.880470.770160.889110.688110.579
Perplexity
 Standard2260.79600.0000.44400.00000.000
 Pro2640.930370.6070.55680.500110.579

aOC-LLM: online chat-based large language model.

bJNLEP: Japanese National License Examination for Pharmacists.

cGPT-4 results obtained from Sato and Ogasawara [14].

dThe mean (SD) for text is 0.788 (0.131), diagram is 0.367 (0.279), calculation is 0.580 (0.219), graph is 0.333 (0.257), and chemical structure is 0.254 (0.213).

Statistical Analysis of Improved Accuracy and Response Consistency

The GLMM analysis demonstrated that the accuracy of the latest flagship models was significantly higher than that of earlier models (P<.001). In addition, questions containing diagrams had significantly lower accuracy than that of text-only questions (P<.001). The interaction term between the flagship status and question type was not significant (P=.53). Therefore, the difference in accuracy between the early and most recent models was consistently observed, regardless of whether the questions included diagrams or were text-based. Moreover, the flagship models did not show a greater improvement in the accuracy of diagram-based questions. The Fleiss’ κ value was 0.334, thus verifying the consistency of each question for the 4 flagship models in the 345 questions.


Overview

This study evaluated the performances of 18 generative AI models in the pharmacy field by applying the same prompt to an identical input task of the 107th JNLEP. Although previous studies evaluated the performance of generative AI in the health care field using several models, this study is the first to directly compare several OC-LLMs under identical conditions for the same task. Recent meta-analyses have evaluated the performance of generative AI in health care [21,22]. However, as individual studies differ in language, prompts, and input tasks, inherent limitations exist in terms of interpreting these results.

Among these models, Perplexity Pro achieved the highest overall accuracy (301/345, 87.2%). When restricted to text-only questions, the ChatGPT o1 preview demonstrated the highest accuracy (271/284, 95.4%). For questions including diagrams, Claude 3.5 Sonnet (new) demonstrated the best performance (47/61, 77%). In early multimodal models, such as ChatGPT GPT-4 and Gemini 1.0 Pro, the accuracy for questions with diagrams was low. However, the accuracy of the latest versions of the flagship models has significantly improved. These findings indicate that the ability to recognize diagrams advanced markedly over the past year.

In terms of overall accuracy, the 4 flagship models exceeded not only the passing criteria but also the average examinee score of 68.2% [14]. This suggests that the current generative AI may possess a more extensive knowledge base than that of novice human pharmacists. However, even the best models had over 10% incorrect answers (ie, hallucinations); therefore, these models must be interpreted with caution, especially in health care.

Subject-specific analysis demonstrated accuracy improvements for all subjects when using the latest 4 flagship models. The performance for the subjects of hygiene and regulations tends to be weaker [16,23,24]. This is likely due to the influence of country-specific health care systems and social contexts, which may not be fully covered by pretraining data. In addition, the low accuracy observed in basic science subjects (physics, chemistry, and biology) is consistent with the trends reported in previous studies [25]. However, even in these subjects, improvements in accuracy were observed with the 2024 flagship models; hence, previous weaknesses may have been overcome. This improvement is likely attributable to the enhanced training data, increased model parameters, and the implementation of multimodal and reasoning capabilities. Although improvements in accuracy were observed, the flagship models still showed low accuracy in subjects, such as chemistry (10.3/20, 51.3%) and physics (15.3/20, 76.3%). This may be because these subjects included many questions that required abilities beyond factual knowledge, including calculations and image recognition. Low accuracy in chemistry has also been reported in previous studies [26].

Question type–specific analysis revealed lower accuracy for items that required image recognition or calculation, relative to text-only questions. Considering that image recognition and calculation are abilities that conventional large language models are not designed to handle and are acquired later through multimodal integration, the insufficient performance in this domain may be due to the incomplete maturation of learning.

Among the diagram-based questions, those involving chemical structures exhibited the lowest accuracy. The small mean and SD across all models for chemistry indicate that the performance of the current models did not show a considerable improvement. This may be because of two factors: (1) chemical structures are foundational scientific knowledge needed exclusively by pharmacists, leading to limited web-based availability (ie, reduced opportunities for large language model pretraining); and (2) interpreting chemical structures requires more sophisticated image recognition skills than that required for the interpretation of tables or graphs.

Claude 3.5 Sonnet (new) demonstrated the highest accuracy across all 3 types of questions—computation, graph interpretation, and chemical structure recognition. However, Claude’s flagship model showed lower accuracy for text-based questions than that of the ChatGPT and Perplexity flagship models. Therefore, a novel finding of this study is that the top-performing model differed according to the question type.

The GLMM analysis demonstrated a significant increase in overall accuracy by 2024. Although improvements in the accuracy of the questions containing diagrams were observed in the individual models, these differences were not statistically significant. The tendency for lower accuracy on diagram-based questions persisted even in the flagship models.

According to Landis and Koch [27], a Fleiss’ κ of 0.344 among the 4 flagship models indicates a certain degree of consistency. This result suggests that although these models handle simpler questions similarly, their incorrect answers differ across more challenging questions, thus indicating variations in their strengths and weaknesses. Initially, it was hypothesized that the types of questions with which each OC-LLM service struggles would differ. Correspondingly, the observation that even the flagship models with high overall accuracy failed to achieve substantial response agreement, as measured by the κ coefficient, supports this hypothesis. Therefore, identifying the specific domains in which each OC-LLM service underperforms remains an important subject for future research, including meta-analysis.

In this study, each model was evaluated using the same task to compare their performance directly. Some models included in this study have been deprecated and are no longer available. Although many new OC-LLMs are expected to emerge in the future, evaluating their performance using the 107th JNLEP will enable their comparison with previous models. Ultimately, the 107th JNLEP can serve as a benchmark for evaluating the performance of generative AI models in the field of pharmacy in Japan.

In this study, questions from the Japanese National Pharmacist Examination were input in Japanese in their original format. Translating non-English tasks into English should improve the accuracy of AI [28-31]. Therefore, this study evaluated the performance of AI models in the pharmaceutical field using Japanese input. However, higher accuracy may be achieved when questions are input using English translations. The accuracy of each model is based on the evaluation time, and the same model may show improved performance due to upgrades. Perplexity has been upgraded multiple times; however, the available models remain as Standard and Pro versions, and the version information is not disclosed to users.

Although the highest-performing model among the 18 OC-LLMs in this study achieved an accuracy of 87.2% (301/345), it also indicated that over 12.8% (n=44 questions) of the responses were incorrect (ie, hallucinations). With the improved performance of the OC-LLMs, it is anticipated that their use by medical and pharmacy students for inputting national examination questions for self-study will increase. However, as the latest models generate logical and fluent answers, it has become increasingly difficult to identify hallucinations. Even when using flagship models in 2024, the following approaches to reduce the risk of hallucinations are required in medical applications: limit use to cases in which users can independently determine the correctness of the output or confirm the supporting source information through the links provided.

The performance improvements of the OC-LLMs in this study may facilitate their broader integration into routine pharmacy practice in the near future. In clinical pharmacy practice, responding to inquiries from patients and health care professionals regarding drug information is a frequent task. These inquiries include questions about adverse drug reactions, drug interactions, dosage adjustments, or contraindications. Suitably, support from high-performance OC-LLMs is expected to improve the quality of responses and reduce the time required to address such inquiries. The use of OC-LLMs in direct medical support, for example, in selecting personalized pharmacological treatments, requires careful consideration of ethical issues, such as explainability, responsibility, privacy, and patient rights.

Principal Findings

This study evaluated the performance of 18 OC-LLMs available in 2024, based on questions from the National Pharmacist’s License Examination in Japan. As the models were upgraded, their accuracy improved. The performance of the flagship models exceeded both the passing criteria and the examinees’ average score. In the latest versions of the OC-LLMs, enhancements in multimodal capabilities significantly improved accuracy in both interpreting charts and figures and solving calculation-based questions. Furthermore, the answer consistency of the flagship models was not robust, which suggests that each model had different strengths and weaknesses.

Limitations

In this study, only a single set of examination questions was tested, and each question was entered only once. Generative AI has a characteristic known as temperature, which refers to the inherent variability in its responses. This means that the model can generate different answers even when given the same question. Therefore, if the test is repeated, the accuracy of each OC-LLM method can vary. Several studies have evaluated OC-LLM performance by testing questions over multiple years [32-34] or conducting multiple rounds of testing [35]. However, similar to many previous studies, to evaluate the 18 models within a limited time frame, only 1 set of questions was administered per examination year to each model. Human examinees also underwent the national pharmacist’s examination only once, rendering the testing conditions comparable. The 107th JNLEP comprises 345 questions, with multiple items allocated to each subject and question type. Therefore, the examination is considered sufficient to allow for a certain degree of interpretation.

With the progressive improvement of the models over time, the top-performing service shifted from ChatGPT to Claude, and subsequently to Gemini. Across OC-LLM services, such as ChatGPT, Gemini, Claude, and Perplexity, no consistent patterns were observed across subjects or question types. Considering that the key information, such as the volume of pretraining data, number of parameters, and tuning strategies of these OC-LLMs, is not publicly disclosed, fully discussing the factors that contribute to their improved performance in the pharmaceutical field is difficult. These factors include understanding of diagrams, chemical structures, and calculation-based questions.

Comparison With Prior Work

Numerous studies have evaluated the performance of OC-LLM in terms of knowledge of health care license examinations (Table 5). Early OC-LLMs failed the National Medical License Examination; however, the subsequent release of high-performance OC-LLMs met the passing criteria.

The reported OC-LLMs in Table 5 are biased toward ChatGPT, and the challenges and conditions vary according to each report on medical performance. Moreover, the performance of OC-LLM declines in languages other than English because of the smaller volume of training data [18,25,34-36]. Therefore, verifying the performance of OC-LLM in non-English languages is important. An important contribution of this study is its demonstration that multiple flagship OC-LLMs substantially outperform the passing criteria in areas where prior evidence is scarce, specifically in non-English languages and the pharmaceutical field. Achieving high response accuracy from OC-LLMs using non-English prompts has considerable implications for clinical implementation in health care settings in Japan (and other non-English–speaking regions).

One of the major strengths of this study is its systematic evaluation of multiple OC-LLMs released in 2024 under identical input conditions, such as the same prompt text and image resolution or size. To the best of our knowledge, this is the first study to evaluate the longitudinal improvement in generative AI performance in medical examinations.

Table 5. Studies evaluating the performance of generative AIa in health care licensing examinations.
Health care license examination studyCountry or regionOC-LLMbAccuracy (%)
Medical license
 Gilson et al (2023) [36]United StatesGPT-325.3
Gilson et al (2023) [36]United StatesChatGPT (unknown)64.4, 57.8
 Flores-Cohaila et al (2023) [37]PeruChatGPT GPT-3.577
Flores-Cohaila et al (2023) [37]PeruChatGPT GPT-486
 Jung et al (2023) [38]GermanyChatGPT (unknown)60.1, 66.7
 Shang et al, (2023) [28]ChinaChatGPT GPT-3.557
 Wang et al, (2023) [39]ChinaChatGPT GPT-3.556
Wang et al, (2023) [39]ChinaChatGPT GPT-484
 Yanagita et al (2023) [40]JapanChatGPT GPT-3.542.8
Yanagita et al (2023) [40]JapanChatGPT GPT-481.5
 Takagi et al (2023) [41]JapanChatGPT GPT-3.550.8
Takagi et al (2023) [41]JapanChatGPT GPT-479.9
 Tanaka et al (2024) [16]JapanChatGPT GPT-3.552.9, 63.6
Tanaka et al (2024) [16]JapanChatGPT GPT-485.6
 Liu et al (2025) [42]JapanChatGPT GPT-477
Liu et al (2025) [42]JapanChatGPT GPT-4o89
Liu et al (2025) [42]JapanGemini 1.5 Pro80
Liu et al (2025) [42]JapanClaude 3 Opus82
 Oztermeli and Oztermeli (2023) [43]TurkeyChatGPT GPT-3.564.7, 67.1, 70.9, 60.8, 54.3
 Siebielec et al (2024) [32]PolandChatGPT GPT-3.559.5, 57.5, 63.5, 62.0, 61.0
 Wójcik et al (2024) [31]PolandChatGPT GPT-467.1
Pharmacist license
 Wang et al (2023) [44]TaiwanChatGPT (unknown)54.5, 63.5
 Wang et al (2025) [45]TaiwanChatGPT GPT-3.559
Wang et al (2025) [45]TaiwanChatGPT GPT-473
 Kunitsu (2023) [46]JapanChatGPT GPT-478.2, 75.3
 Sato and Ogasawara (2024) [14]JapanChatGPT GPT-3.545.5
Sato and Ogasawara (2024) [14]JapanChatGPT GPT-472.5
 Jin and Kim (2024) [47]KoreaChatGPT GPT-3.561
Jin and Kim (2024) [47]KoreaChatGPT GPT-487
Nurse license
 Taira et al (2023) [33]JapanChatGPT GPT-3.571, 71, 63, 63, 63
 Kaneda et al (2023) [48]JapanChatGPT GPT-3.559.9
Kaneda et al (2023) [48]JapanChatGPT GPT-479.7
 Wu et al (2024) [49]ChinaChatGPT GPT-3.551.7
Wu et al (2024) [49]ChinaChatGPT GPT-470.5
Wu et al (2024) [49]ChinaGoogle Bard48.3
 Hiwa et al (2024) [50]UnknownChatGPT GPT-3.577
Hiwa et al (2024) [50]UnknownGemini (unknown)75
Hiwa et al (2024) [50]UnknownMicrosoft Copilot84
Hiwa et al (2024) [50]UnknownLlama268

aAI: artificial intelligence.

bOC-LLM: online chat-based large language model.

Conclusions

This study reveals that the performance of OC-LLMs in the pharmaceutical field has greatly improved as of 2024. Particularly, an increase in accuracy was observed for questions with diagrams. In the most recent version of the models, evaluated in 2024, the overall accuracy reached 85%, markedly exceeding the average examinee score, and indicating their potential as valuable support tools. Although caution is necessary due to the potentially serious impact of hallucinations on health care, the benefits of OC-LLMs outweigh the associated risks. Accordingly, health care professionals and medical educators must acquire the skills necessary to effectively use OC-LLMs, particularly the ability to recognize and manage hallucinations.

Data Availability

The datasets analyzed during this study are available in the Ministry of Health, Labour and Welfare in Japan repository [19].

Conflicts of Interest

None declared.

Multimedia Appendix 1

The responses of all the online chat-based large language models to each question.

XLSX File, 41 KB

  1. Kruse M, Hu S, Derby N, et al. Zero-shot large language models for long clinical text summarization with temporal reasoning. medRxiv. Preprint posted online on Jul 23, 2025. [CrossRef] [Medline]
  2. Fraile Navarro D, Coiera E, Hambly TW, et al. Expert evaluation of large language models for clinical dialogue summarization. Sci Rep. Jan 7, 2025;15(1):1195. [CrossRef] [Medline]
  3. Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. Apr 2024;30(4):1134-1142. [CrossRef] [Medline]
  4. Lee C, Vogt KA, Kumar S. Prospects for AI clinical summarization to reduce the burden of patient chart review. Front Digit Health. 2024;6:1475092. [CrossRef] [Medline]
  5. UiPath announces AI partnership with Google Cloud to transform medical processes. UiPath. 2025. URL: https://www.uipath.com/newsroom/uipath-announces-medical-summarization-agent-google-cloud [Accessed 2025-08-27]
  6. OpenBots Gen AI automates patient referral, improves productivity by 30% and reduces errors by 80%. OpenBots. 2024. URL: https:/​/openbots.​ai/​streamlining-patient-referral-handling-and-emr-integration-through-openbots-gen-ai-automation/​ [Accessed 2025-03-05]
  7. Hale J, Alexander S, Wright ST, Gilliland K. Generative AI in undergraduate medical education: a rapid review. Journal of Medical Education and Curricular Development. Jan 2024;11. [CrossRef]
  8. Parente DJ. Generative artificial intelligence and large language models in primary care medical education. Fam Med. Oct 2024;56(9):534-540. [CrossRef] [Medline]
  9. Janumpally R, Nanua S, Ngo A, Youens K. Generative artificial intelligence in graduate medical education. Front Med (Lausanne). 2024;11:1525604. [CrossRef] [Medline]
  10. Preiksaitis C, Rose C. Opportunities, challenges, and future directions of generative artificial intelligence in medical education: scoping review. JMIR Med Educ. Oct 20, 2023;9:e48785. [CrossRef] [Medline]
  11. Biswas S. ChatGPT and the future of medical writing. Radiology. Apr 2023;307(2):e223312. [CrossRef] [Medline]
  12. AI adoption in health systems report 2024. Medscape & HIMSS. URL: https://cdn.sanity.io/files/sqo8bpt9/production/68216fa5d161adebceb50b7add5b496138a78cdb.pdf/ [Accessed 2025-03-05]
  13. Bruce G. How many healthcare employees use generative AI at work. Becker’s Hospital Review. 2024. URL: https:/​/www.​beckershospitalreview.com/​rankings-and-ratings/​how-many-healthcare-employees-use-generative-ai-at-work.html/​ [Accessed 2025-03-05]
  14. Sato H, Ogasawara K. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. J Educ Eval Health Prof. 2024;21:4. [CrossRef] [Medline]
  15. Top generative AI chatbots by market share. FirstPageSage. Jan 2025. URL: https://firstpagesage.com/reports/top-generative-ai-chatbots/ [Accessed 2025-03-05]
  16. Tanaka Y, Nakata T, Aiga K, et al. Performance of generative pretrained transformer on the National Medical Licensing Examination in Japan. PLOS Digit Health. Jan 2024;3(1):e0000433. [CrossRef] [Medline]
  17. Wada A, Akashi T, Shih G, et al. Optimizing GPT-4 Turbo diagnostic accuracy in neuroradiology through prompt engineering and confidence thresholds. Diagnostics (Basel). Jul 17, 2024;14(14):1541. [CrossRef] [Medline]
  18. Yan S, Knapp W, Leong A, et al. Prompt engineering on leveraging large language models in generating response to InBasket messages. J Am Med Inform Assoc. Oct 1, 2024;31(10):2263-2270. [CrossRef] [Medline]
  19. The 107th Japanese National License Examination for Pharmacist [Article in Japanese]. The Ministry of Health, Labour and Welfare in Japan. URL: https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000198924.html [Accessed 2025-08-27]
  20. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378-382. [CrossRef]
  21. Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ. Sep 16, 2024;24(1):1013. [CrossRef] [Medline]
  22. Takita H, Kabata D, Walston SL, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit Med. 2025;8(1). [CrossRef]
  23. Hideo D, Hidekazu I, Hiroki N, et al. Performance of generative pretrained transformer on the national licensing examination for medical technologist in Japan [Article in Japanese]. Jpn J Med Technol. 2024;73(2):323-331. URL: https://doi.org/10.14932/jamt.23-80 [Accessed 2025-09-04]
  24. Meo SA, Alotaibi M, Meo MZS, Meo MOS, Hamid M. Medical knowledge of ChatGPT in public health, infectious diseases, COVID-19 pandemic, and vaccines: multiple choice questions examination based performance. Front Public Health. 2024;12:1360597. [CrossRef] [Medline]
  25. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. Feb 2023;2(2):e0000198. [CrossRef] [Medline]
  26. Fergus S, Botha M, Ostovar M. Evaluating academic answers generated using ChatGPT. J Chem Educ. Apr 11, 2023;100(4):1672-1675. [CrossRef]
  27. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. Mar 1977;33(1):159-174. [CrossRef] [Medline]
  28. Shang L, Xue M, Hou Y, Tang B. Can ChatGPT pass China’s national medical licensing examination? Asian J Surg. Dec 2023;46(12):6112-6113. [CrossRef] [Medline]
  29. Cohen A, Alter R, Lessans N, Meyer R, Brezinov Y, Levin G. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Arch Gynecol Obstet. Dec 2023;308(6):1797-1802. [CrossRef] [Medline]
  30. Lewandowski M, Łukowicz P, Świetlik D, Barańska-Rybak W. ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology. Clin Exp Dermatol. Jun 25, 2024;49(7):686-691. [CrossRef] [Medline]
  31. Wójcik D, Adamiak O, Czerepak G, Tokarczuk O, Szalewski L. A comparative analysis of the performance of Chatgpt4, Gemini and Claude for the Polish Medical Final Diploma Exam and Medical-Dental Verification Exam. medRxiv. Preprint posted online on Jul 29, 2024. [CrossRef]
  32. Siebielec J, Ordak M, Oskroba A, Dworakowska A, Bujalska-Zadrozny M. Assessment study of ChatGPT-3.5’s performance on the final Polish medical examination: accuracy in answering 980 questions. Healthcare (Basel). Aug 16, 2024;12(16):1637. [CrossRef] [Medline]
  33. Taira K, Itaya T, Hanada A. Performance of the large language model ChatGPT on the National Nurse Examinations in Japan: evaluation study. JMIR Nurs. Jun 27, 2023;6:e47305. [CrossRef] [Medline]
  34. Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese National Medical Licensing Examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ. Feb 14, 2024;24(1):143. [CrossRef] [Medline]
  35. Fujimoto M, Kuroda H, Katayama T, et al. Evaluating large language models in dental anesthesiology: a comparative analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam. Cureus. Sep 2024;16(9):e70302. [CrossRef] [Medline]
  36. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. Feb 8, 2023;9:e45312. [CrossRef] [Medline]
  37. Flores-Cohaila JA, García-Vicente A, Vizcarra-Jiménez SF, et al. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: cross-sectional study. JMIR Med Educ. Sep 28, 2023;9:e48039. [CrossRef] [Medline]
  38. Jung LB, Gudera JA, Wiegand TLT, Allmendinger S, Dimitriadis K, Koerte IK. ChatGPT passes German State Examination in Medicine with picture questions omitted. Dtsch Arztebl Int. May 30, 2023;120(21):373-374. [CrossRef] [Medline]
  39. Wang H, Wu W, Dou Z, He L, Yang L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inform. Sep 2023;177:105173. [CrossRef] [Medline]
  40. Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on medical questions in the National Medical Licensing Examination in Japan: evaluation study. JMIR Form Res. Oct 13, 2023;7:e48023. [CrossRef] [Medline]
  41. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison study. JMIR Med Educ. Jun 29, 2023;9:e48002. [CrossRef] [Medline]
  42. Liu M, Okuhara T, Dai Z, et al. Performance of advanced large language models (GPT-4o, GPT-4, Gemini 1.5 pro, Claude 3 opus) on Japanese Medical Licensing Examination: a comparative study. Int J Med Inform. 2025:105673. [CrossRef] [Medline]
  43. Oztermeli AD, Oztermeli A. ChatGPT performance in the medical specialty exam: an observational study. Medicine (Abingdon). 2023;102(32):e34673. [CrossRef]
  44. Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Assoc. Jul 1, 2023;86(7):653-658. [CrossRef] [Medline]
  45. Wang YM, Shen HW, Chen TJ, Chiang SC, Lin TG. Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: comparative evaluation study. JMIR Med Educ. Jan 17, 2025;11:e56850. [CrossRef] [Medline]
  46. Kunitsu Y. The potential of GPT-4 as a support tool for pharmacists: analytical study using the Japanese National Examination for Pharmacists. JMIR Med Educ. Oct 30, 2023;9:e48452. [CrossRef] [Medline]
  47. Jin HK, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: comparison study. JMIR Med Educ. Dec 4, 2024;10:e57451. [CrossRef] [Medline]
  48. Kaneda Y, Takahashi R, Kaneda U, et al. Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination. Cureus. Aug 2023;15(8):e42924. [CrossRef] [Medline]
  49. Wu Z, Gan W, Xue Z, Ni Z, Zheng X, Zhang Y. Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: cross-sectional study. JMIR Med Educ. Oct 3, 2024;10:e52746. [CrossRef] [Medline]
  50. Hiwa DS, Abdalla SS, Muhialdeen AS, Karim SO. Assessment of nursing skill and knowledge of ChatGPT, Gemini, Microsoft Copilot, and Llama: a comparative study. Barw Med J. 2024;2(3). [CrossRef]


AI: artificial intelligence
GLMM: generalized linear mixed model
JNLEP: Japanese National License Examination for Pharmacists
OC-LLM: online chat-based large language model


Edited by Joshua Moen; submitted 04.May.2025; peer-reviewed by Ian Murray, Reinhard Chun Wang Chau; final revised version received 16.Jul.2025; accepted 31.Jul.2025; published 18.Sep.2025.

Copyright

© Hiroyasu Sato, Katsuhiko Ogasawara, Hidehiko Sakurai. Originally published in JMIR Medical Education (https://mededu.jmir.org), 18.Sep.2025.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.