Background

JMIR Med Educ

mededu

JMIR Medical Education

JMIR Med Educ

2369-3762

JMIR Publications

Toronto, Canada

v12i1e75516

10.2196/75516

Original Paper

Comparing the Weighted Gain Score and a Rasch-Based Approach for Estimating Learning Outcomes in Medical Education: Quantitative Study

Aliyev

Rauf

MD1Backhaus

Joy

MSc1Hammer

Silke

MD2König

Sarah

MME, MD1

Institute of Medical Teaching and Medical Education Research, University Hospital Würzburg

Josef-Schneider-Str. 2/D6

Würzburg

GermanyInstitute of Diagnostic and Interventional Radiology, University Hospital Würzburg

Würzburg

Germany

Bahattab

Awsan

Shafi

Muhammad Saeed

Alzaabi

Shaikha

Valencia-Perez

T A

Correspondence to Sarah König, MME, MD, Institute of Medical Teaching and Medical Education Research, University Hospital Würzburg, Josef-Schneider-Str. 2/D6, Würzburg, 97080, Germany, +49 931 201 55210, +49 931 201 655213; Koenig_Sarah@ukw.de

2026

1662026

e75516

180520251903202622042026

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.

Background

Pretest-posttest designs are widely used to estimate learning gain in studies evaluating educational interventions in medical education. The Weighted Gain Score (WGS) was proposed to reduce bias associated with differences in baseline performance.

Objective

This study evaluated the statistical and inferential properties of the WGS by comparing it to Rasch Learning Gain (RLG) across 3 datasets.

Methods

The WGS implements a weighting coefficient that includes the parameter µ, which linearly rescales the difference between pretest and posttest percentage scores. We examined the effect of varying µ (30, 50, and 70) on learning gain calculations and compared the results with those obtained using RLG. The following three datasets were analyzed: (1) a small illustrative dataset demonstrating the mathematical behavior of the WGS, (2) an empirical dataset from a previous educational evaluation study, and (3) a randomly generated binomial dataset designed to examine the metric under larger sample conditions.

Results

Changing the parameter µ in the WGS affected the magnitude of the calculated learning gains: lower µ-values produced larger gain estimates, whereas higher µ-values produced smaller estimates. Despite these differences in scale, the WGS and RLG correlated strongly in both the empirical dataset (r=0.93; P<.001) and the simulated dataset (r=0.92; P<.001); variation in µ did not alter the inferential results. Both methods identified the same interaction effect in the empirical dataset.

Conclusions

The WGS produced results highly consistent with those of RLG while requiring substantially lower computational complexity. The metric can be applied to both small and large datasets and allows µ to function as an adjustment coefficient for calibrating learning gain estimates across cohorts without altering inferential conclusions.

medical educationteaching qualitycurriculum evaluationlearning gainpretest-posttest designRasch modelWeighted Gain Score

Introduction

Teaching quality in medical education is a complex construct encompassing curriculum design, instructional methods, teaching expertise, learner engagement, and assessment practices [1-4]. High-quality teaching in this context contributes to the development of competent physicians and thereby influences the quality of patient care [5,6].

Among the various aspects of teaching quality in medical education, student learning outcomes represent one measurable indicator frequently used in program evaluation and educational research [7-11]. However, interpreting learning outcomes as indicators of teaching effectiveness requires caution, as they are influenced by multiple factors beyond instructional quality. These include student motivation, prior knowledge, learning strategies, teacher enthusiasm, and learning activities occurring outside the formal curriculum [12,13]. To account for these influences, educational research often focuses not only on absolute performance but also on changes in performance over time. The concept of learning gain represents a widely used approach to capturing students’ learning progress. In educational research, learning gain is commonly operationalized by assessing students before (pretest) and after (posttest) an educational intervention. The difference between pretest and posttest scores is then interpreted as an indicator of learning gain attributable, at least in part, to the educational intervention [14-16]. However, calculating learning gain is not trivial, as simple difference scores may lead to biased estimates depending on students’ baseline knowledge. One simple approach is raw gain, which is calculated as the arithmetic difference between posttest and pretest scores. However, raw gain scores exhibit a negative correlation with baseline performance (ie, pretest scores) and are also affected by ceiling effects, meaning that students with lower pretest scores may appear to exhibit larger gains simply because they have more room for improvement [17-19].

To address these limitations, several modified gain metrics have been proposed. One widely used approach is the normalized gain introduced by Hake [20], which expresses the observed pretest-posttest gain relative to the maximum possible gain. Although this metric has been applied extensively in educational research [21,22], it also has important methodological limitations. It remains dependent on baseline performance, may inflate gains for students with high pretest scores, and behaves inconsistently when posttest scores fall below pretest scores or when pretest scores approach the maximum value [16,23].

Taken together, existing gain metrics may distort estimates of learning gain, particularly in cohorts with heterogeneous baseline knowledge. Many of these metrics either remain strongly dependent on baseline performance or require complex psychometric modeling. This highlights the need for approaches that provide statistically robust yet practically applicable estimates of learning gain in educational evaluation.

A recently proposed metric developed by our workgroup, the “Weighted Gain Score” (WGS), aims to address these limitations by applying a weighting coefficient that adjusts gain calculations according to students’ baseline performance [16]. However, the statistical and inferential properties of this metric have not yet been systematically investigated. To address this gap, we evaluated the WGS by comparing it with Rasch Learning Gain (RLG), a Rasch model–based approach for estimating learning gain that served as the benchmark in our study [24]. Specifically, we addressed the following research questions:

Does the WGS produce inferential results comparable to those produced by RLG?

Can the parameter µ in the WGS be adjusted for different cohorts to calibrate learning gain calculations without altering inferential conclusions?

Through this analysis, we aimed to clarify the statistical behavior of the WGS and explore its potential applicability for the evaluation of educational interventions.

MethodsMetric WGS

The mathematical foundation of the WGS lies in the use of the weighting coefficient “pre/µ,” which linearly transforms the difference between pretest and posttest percentage scores (denoted as “pre” and “post” in equation 1), thereby adjusting for pretest variability [16]. Formally, the WGS is defined as:

(1)WGS=(post−pre)×(pre/μ)

To illustrate the computation, consider a hypothetical student with a pretest score of 40% and a posttest score of 70%.

For µ=50, the WGS is calculated as: WGS = (70 – 40) × (40/50) = 30 × 0.8 = 24.

If µ is increased to 70, the same performance yields: WGS = (70 – 40) × (40/70) = 30 × 0.57 = 17.14.

This example illustrates that increasing µ reduces the magnitude of the calculated gain while preserving the relative ordering of observations. When posttest scores fall below pretest scores, the WGS assumes negative values, indicating a decrease in performance.

Originally, the parameter µ used in the weighting coefficient was defined as the average pretest score of a cohort. It was constrained to integer values between 1 and 100, consistent with the percentage format of “pre” and “post.” In the original formulation, its value was set at 50 as a default reference value [16]. In this study, µ is interpreted as an adjustment coefficient that functions as a scaling parameter for learning gain calculations. Changing its value proportionally rescales the calculated gain scores: higher values of µ lead to smaller gain estimates, whereas lower values produce larger gain estimates. Importantly, this modification represents a linear transformation of the calculated values and therefore does not alter the underlying statistical relationships among observations.

To examine the influence of this parameter on the stability of the WGS, we tested 3 calibration levels in our datasets: µ=30, µ=50, and µ=70. These values represent 3 nonextreme points within the possible range of 1 to 100, allowing us to evaluate the behavior of the WGS across low, moderate, and high scaling conditions.

Rasch Model and RLG

The Rasch model is a fundamental concept in modern psychometric measurement. The probability that a student answers a specific item correctly depends on 2 key factors: the student’s ability and the difficulty of the item. In the Rasch framework, a student’s latent ability is denoted by θ, whereas item difficulty is represented by β. When a student’s ability exceeds the difficulty of an item, the probability of answering correctly increases, and vice versa [25]. Because the Rasch model allows the estimation of individual students’ abilities independently of the specific test items used, it is widely applied in educational measurement and medical education research [26]. With this in mind, we selected RLG as a reference method for evaluating the WGS.

We applied the dichotomous 1-parameter logistic Rasch model. Item parameters were estimated using conditional maximum likelihood estimation. On the basis of the fitted model, person abilities were subsequently calculated using maximum likelihood estimation separately for the pretest (θ_pre) and posttest (θ_post) data [27].

As indicated in equation 2, RLG was defined as the difference between the estimated posttest and pretest abilities. This difference represents the change in latent ability on the Rasch measurement scale and serves as an estimate of individual learning gain across the instructional intervention [24,28].

(2)RLG=(θpost−θpre)

To ensure the validity of Rasch-based ability estimates, we examined global model fit indicators. Item infit and outfit statistics ranged between 0.7 and 1.3, which is generally considered acceptable for the Rasch model. In addition, person reliability exceeded 0.8, and separation indices were >2, indicating satisfactory measurement precision.

Datasets

Three datasets were used to examine the behavior of the WGS under different analytical conditions:

The illustrative dataset (n=10): a small artificial dataset designed to illustrate the mathematical behavior of the parameter µ within the WGS metric

The empirical dataset (n=170): a dataset consisting of real-world data derived from a previously published educational evaluation study [29], used to examine the behavior of the WGS under authentic educational conditions and to perform inferential statistical analyses

The simulated dataset (n=1000): a randomly generated binomial dataset designed to mirror the structure of the empirical dataset while providing a larger sample size, allowing the behavior of the parameter µ to be examined independently of the empirical data

The Illustrative Dataset

Following the design of the simulated dataset in our previous study [16], we created an artificial dataset by combining different pretest scores with varying levels of raw gain in test performance, defined as the absolute difference between posttest and pretest scores. Pretest scores ranged from 1 to 10 points, and the gain in performance was simulated by increasing test scores by 1 to 4 points. To avoid potential ceiling effects, the analysis included only combinations in which the sum of pretest scores and the simulated gains did not exceed the maximum of 10 points. The sample size of the illustrative dataset was set at 10. RLG was not applicable here, as Rasch model–based estimation requires larger sample sizes to obtain stable parameter estimates [30].

The Empirical Dataset

The empirical dataset originated from a prospective educational study conducted at the University Medical Center Göttingen in Göttingen, Germany [29]. The study compared the learning gain of students attending a traditional lecture on goiter with that of students using a corresponding video podcast (vodcast) within the teaching module “Operative Medicine.” The study was conducted over 2 consecutive semesters using a pretest-posttest design based on 9 multiple-choice test items. A total of 170 students participated. Students were additionally surveyed regarding their learning dispositions, which resulted in the classification of participants into 2 groups: “traditional learners” and “digital natives.” A total of 35 students (20.59%) could not be clearly assigned to either group and were therefore excluded from group-based analyses. Consequently, 135 (79.41%) students were included in the 2-way ANOVA examining the interaction between teaching format and learning disposition (Table 1).

Table 1.

Distribution of students according to teaching format and learning disposition in the empirical and simulated datasets.

Datasets and teaching formats	Traditional learners, n (%)	Digital natives, n (%)
Empirical dataset (N=135)
Lecture	38 (28.15)	34 (25.19)
Vodcast	28 (20.74)	35 (25.93)
Simulated dataset (N=1000)
Lecture	259 (25.9)	210 (21)
Vodcast	250 (25)	281 (28.1)

The Simulated Dataset

The simulated dataset was generated using a random binomial distribution, assuming a 50% probability of correctly answering a hypothetical examination item. This probability was applied to 9 multiple-choice items in both the pretest and the posttest scores, reflecting the structure of the “empirical dataset.” Apart from the larger sample size, the primary difference between the empirical and simulated datasets was the random allocation of group variables. Two variables were simulated: teaching format and learning disposition. Both variables were coded dichotomously. For consistency in labeling, the simulated variables were named analogously to those in the empirical dataset, although they represent random group assignments rather than actual instructional formats or learning characteristics. Each simulated student had a 50% probability of being assigned to each category (Table 1). The sample size for the simulated dataset was 1000.

Statistical Analysis

All simulations and statistical analyses were conducted using the R software suite (version 4.1.2; R Foundation for Statistical Computing) [31]. Rasch modeling was performed using the eRm package [32].

To examine the relationship between the 2 learning gain metrics, Pearson correlation coefficients were calculated between the WGS and RLG scores.

To investigate potential interaction effects between teaching format and learning disposition, we conducted a 2-way ANOVA. Post hoc comparisons were performed using Bonferroni-adjusted contrasts. Effect sizes were reported as partial η², and 95% CIs were calculated where appropriate.

Normality of the dependent variables was assessed using the Shapiro-Wilk test and visual inspection of Q-Q plots. Minor deviations from normality were observed, which are common in bounded percentage scores (0% to 100%) frequently used in educational assessments. Given the present sample sizes and the absence of influential outliers, ANOVA was considered sufficiently robust to moderate violations of the normality assumption.

Homogeneity of variances across groups was evaluated using the Levene test and the Brown-Forsythe test, both of which indicated no statistically significant differences in variance between groups. All statistical tests were 2-sided, and a significance level of P<.05 was applied.

Ethical Considerations

The empirical data analyzed within this work were reviewed and judged by the local institutional review and ethics board (Medical Ethics Committee, University Medical Center Göttingen) as not representing medical or epidemiological research on human participants and, therefore, were assessed using a simplified assessment protocol. The project was approved without any reservation under proposal number 1-11-14.

ResultsEffect of the Parameter µ on WGS Learning Gain Estimates

The illustrative dataset demonstrates the mathematical effect of varying µ (30, 50, and 70) on the WGS. Changes in µ systematically altered the slope of the WGS learning gain plots (Figure 1). Each subplot represents a different raw gain scenario, ranging from 1 to 4 points. As the µ-value increased, the slope of the learning gain curve decreased, resulting in smaller WGS values for the same pretest score. For example, with a gain of 1 point and a pretest score of 6, the WGS was approximately 20% for µ=30 and decreased to <10% for µ=70. This pattern remained consistent across all 4 gain scenarios, illustrating that increasing µ reduces the magnitude of the calculated learning gain while preserving the relative ordering of observations.

Figure 1.

Effect of varying µ (30, 50, and 70) on Weighted Gain Score (WGS) learning gain estimates in the illustrative dataset. (A) Gain of 1 point, (B) gain of 2 points, (C) gain of 3 points, and (D) gain of 4 points.

Correlation Analysis Between WGS and RLG

The WGS demonstrated a strong positive correlation with RLG across all tested µ-values (Figure 2). In the empirical dataset, the Pearson correlation coefficient was consistently high (r=0.93; P<.001). A similarly strong relationship was observed in the simulated dataset (r=0.92; P<.001). The correlation coefficients remained identical across the tested µ-values (30, 50, and 70) in both datasets.

Figure 2.

Correlation between Weighted Gain Score (WGS) and Rasch Learning Gain (RLG) in the empirical and simulated datasets. (A) Empirical dataset with µ=30, (B) empirical dataset with µ=50, (C) empirical dataset with µ=70, (D) simulated dataset with µ=30, (E) simulated dataset with µ=50, and (F) simulated dataset with µ=70.

Analysis of Interaction Effects Using WGS and RLG

All 3 calibrations of the WGS (µ=30, µ=50, and µ=70) detected a significant interaction effect between teaching format and learning disposition in the empirical dataset (Figure 3). Traditional learners displayed higher learning gains in the lecture format than digital natives (F_1,131=6.51; P=.01; partial η²=0.05). For µ=50, the mean difference was −11.64 (95% Bonferroni-adjusted CI −21.46 to −1.83; P=.01). Corresponding estimates were −19.41 for µ=30 (95% Bonferroni-adjusted CI −35.80 to −3.04; P=.01) and −8.32 for µ=70 (95% Bonferroni-adjusted CI −15.33 to −1.31; P=.01).

Figure 3.

Learning gain estimates calculated using Weighted Gain Score (WGS) and Rasch Learning Gain (RLG), depicting the interaction between teaching format and learning disposition in the empirical and simulated datasets. (A) Empirical dataset with WGS (µ=30), (B) empirical dataset with WGS (µ=50), (C) empirical dataset with WGS (µ=70), (D) empirical dataset with RLG, (E) simulated dataset with WGS (µ=30), (F) simulated dataset with WGS (µ=50), (G) simulated dataset with WGS (µ=70), and (H) simulated dataset with RLG. **Indicates statistical significance at P=.01.

RLG also detected this interaction effect (F_1,131=6.75; P=.01; partial η²=0.05) with a mean difference of −19.91 (95% Bonferroni-adjusted CI −36.80 to −3.05; P=.01), confirming the interaction pattern observed in the original study from which our empirical dataset was derived [16,29].

In the simulated dataset, no significant interaction between teaching format and learning disposition was observed when learning gains were calculated using the WGS, regardless of the µ-value applied (F_1,996=0.39; P=.53; partial η²<0.001; Figure 3). Similarly, RLG did not reveal any significant difference in performance between the groups (F_1,996=1.10; P=.29; partial η²=0.001). Because teaching format and learning disposition were randomly assigned in the simulated dataset, we did not necessarily expect any interaction effect.

DiscussionInferential Behavior of WGS Compared With RLG

A robust method for calculating learning gain is essential for capturing students’ learning progress following an educational intervention and for providing interpretable indicators of educational effectiveness. Such a method should be statistically sound, transparent, and practically applicable within evaluation processes.

This study evaluated the statistical behavior of the WGS, a method designed to estimate learning gain in a way that is both methodologically robust and straightforward to implement. The first research question examined whether the WGS yields inferential results comparable to those obtained with RLG. Our findings demonstrated a strong inferential correspondence between the 2 methods. The WGS produced learning gain estimates that correlated highly with those derived from RLG, while also identifying the same interaction effect in the empirical dataset as the Rasch model–based approach. Importantly, these inferential conclusions remained stable across all tested µ-values (30, 50, and 70). The identical correlation coefficients between the WGS and RLG and the unchanged ANOVA results indicate that modifying the parameter µ linearly rescales learning gain estimates. Consequently, varying µ changes the magnitude of WGS values but does not affect statistical inference.

Robustness of WGS Under Nonnormally Distributed Data

Neither the empirical nor the simulated dataset fully satisfied the assumption of normality, although no substantial skewness was observed. In medical education research, deviations from normality are common, particularly in pretest-posttest designs [17,33]. A ceiling effect occurs when pretest scores approach the maximum possible value, limiting the measurable improvement, whereas a floor effect arises when pretest performance is concentrated near the minimum score in a difficult test. Very easy items tend to produce ceiling effects, whereas very difficult items may lead to floor effects. Despite deviations from normality, the WGS demonstrated stable inferential behavior across the empirical and simulated datasets, suggesting a degree of robustness. This finding is consistent with previous research indicating that parametric methods such as ANOVA and correlation analyses are generally robust to moderate violations of normality, particularly in samples of the size examined in this study [34-36]. Nevertheless, future research is needed to examine the behavior of the WGS across a broader range of distributional scenarios to better establish its reliability.

Applicability of WGS in Small Samples

The illustrative dataset demonstrates that the WGS yields interpretable results even with very small sample sizes. In contrast, Rasch model–based approaches typically require substantially larger samples to ensure stable estimation of item parameters and person abilities [24,26]. This distinction is particularly relevant in educational settings with small cohorts, such as specialized teaching modules, pilot courses, or resource-intensive instructional interventions. In such contexts, the WGS may represent a practical alternative method for estimating learning gain because it does not rely on complex parameter estimation.

More broadly, transparent feedback on learning outcomes supports the continuous development of teaching practices, as evidence suggests that feedback on educational performance encourages educators to engage in reflective improvement of their teaching [37-39].

The Role of µ as an Adjustment Coefficient

The second research question examined whether the parameter µ in the WGS can be adjusted across different cohorts to calibrate learning gain calculations without altering inferential outcomes. In the original study introducing the WGS, µ was defined as the average pretest score of a cohort. Our findings suggest that the role of µ can be understood more broadly. Rather than representing solely the cohort mean, µ functions as a scaling parameter that allows calibration of the learning gain metric. To reflect this role more accurately, we interpret µ in this study as an adjustment coefficient that can be modified depending on the analytical purpose of the evaluation. On the basis of the results of this study, 3 conceptual adjustment strategies can be distinguished: absolute adjustment, relative adjustment, and routine evaluation. A decision framework for selecting µ is provided in Multimedia Appendix 1.

Absolute Adjustment: Monitoring of Cohort Learning

Absolute adjustment refers to the use of a fixed µ-value to estimate learning gain within a stable scaling framework. When µ remains constant, differences in learning gain across courses, time points within the curriculum, or different cohorts can be interpreted without recalibration of the metric, thereby ensuring cross-cohort comparability. This approach supports standardized monitoring of educational outcomes, for example, when evaluating curricular developments over time or comparing modules within a program. Observed differences in learning gain may arise from multiple factors, including instructional design, assessment characteristics, or cohort composition. Maintaining a fixed µ ensures that such differences remain visible and can be attributed to substantive factors.

Relative Adjustment: Evaluation of Teaching Interventions

Relative adjustment enables comparison of teaching interventions across cohorts with heterogeneous characteristics. In educational practice, cohorts often differ in characteristics such as demographics, motivation, workload, or external contextual influences [13,40]. When learning gain is used to compare instructional formats, such heterogeneity may affect the interpretation of outcomes. Under a relative adjustment strategy, a µ-value may be calibrated separately for each cohort, allowing the scaling of learning gain calculations to reflect cohort-specific baseline conditions. Although this approach does not eliminate potential confounding factors, it may reduce systematic bias associated with heterogeneous starting conditions. This strategy is particularly useful when learning gain is evaluated without strict requirements for cross-cohort comparability, but with a focus on fair comparison of teaching interventions within specific cohorts or instructional contexts.

Routine Evaluation: Selecting µ in Practice

In routine applications, when learning gain is estimated without strict requirements for cross-cohort comparability or cohort-specific calibration, µ may be selected pragmatically based on the cohort’s mean pretest performance. For example, cohorts with mean pretest scores approximately 50% of the maximum achievable score may be assigned µ=50, whereas cohorts with substantially higher or lower baseline knowledge may be assigned correspondingly higher (eg, ≥70) or lower (eg, ≤30) µ-values. This pragmatic approach enables straightforward estimation of learning gain while preserving a transparent and easily interpretable scaling of the WGS metric.

Limitations

One limitation of the WGS arises when a student obtains a pretest score of zero, which results in a calculated learning gain of zero regardless of posttest performance. In practice, such cases are unlikely in multiple-choice assessments because guessing and prior knowledge increase the probability of obtaining at least 1 correct response [41]. In the empirical dataset, no student recorded a pretest score of zero, and in the simulated dataset, a negligible number (3 out of 1000 students) achieved zero points on the pretest score. One possible strategy is to exclude such observations from the analysis. However, this may reduce statistical power and introduce bias if students with low baseline scores are systematically underrepresented. Alternatively, a small positive offset (pseudocount) could be added to avoid undefined computations, analogous to continuity corrections used in categorical data analysis [42,43]. The implications of such adjustments should be examined in future methodological studies, for example, through sensitivity analyses comparing different handling strategies for zero-baseline observations [44].

A further limitation concerns the sample size of the empirical dataset (n=170). Although cohort sizes of this magnitude are common in single-semester cohorts at German medical faculties, they are slightly below commonly cited recommendations for stable Rasch parameter estimation, which often suggest sample sizes of approximately 150 to 200 participants or more [45]. Nevertheless, global model fit indicators in the empirical dataset (infit and outfit statistics, person reliability, and separation indices) were within acceptable ranges, supporting the interpretability of the RLG-based estimates despite the moderate sample size.

Another limitation relates to the test length used in the simulated dataset, which consisted of 9 multiple-choice items to mirror the empirical dataset. Because measurement reliability generally increases with test length [46-50], the limited number of items may reduce measurement precision and restrict the generalizability of the findings. Therefore, future research should examine the performance of the WGS in assessments with larger item sets that more closely reflect the scope of medical examinations.

Finally, both datasets exhibited deviations from normality, although homogeneity of variances across groups was supported by the Levene and Brown-Forsythe tests, and no influential outliers were observed. Previous methodological research indicates that ANOVA and Pearson correlation are generally robust to moderate violations of normality, particularly in samples of the present size [34-36]. Therefore, we consider the impact of nonnormality on the inferential conclusions to be limited.

Conclusions and Future Research

This study evaluated the WGS as a method for estimating learning gain in pretest-posttest educational designs. Our findings indicate that the WGS provides robust and easily interpretable estimates while remaining computationally simple. Rather than replacing established psychometric models, the WGS may complement existing approaches, particularly in routine educational evaluations.

Future research should further develop the WGS as a broadly applicable evaluation instrument. In particular, establishing a methodologically sound calibration framework for µ will be essential, including empirically grounded decision models that guide µ-selection according to the evaluation purpose, such as cohort monitoring or comparative evaluation of teaching interventions. In addition, integrating the WGS into structured program evaluations, including longitudinal monitoring across courses, will be important for assessing its generalizability across educational contexts.

Future work may also explore the integration of the WGS within Bayesian test-theoretical frameworks [51]. By incorporating prior information and updating gain estimates as new data become available, Bayesian approaches could further improve the precision and contextual sensitivity of WGS-based learning gain estimates. Further studies should also examine the behavior of the WGS under different distributional conditions to better establish its robustness.

The authors sincerely thank Simone Kann and Michael Schuler for their valuable insights and thoughtful suggestions, as well as Andrew Entwistle for his contribution to the revision of this manuscript.

Funding

This research did not receive funding from any specific grant provided by public, commercial, or not-for-profit agencies.

Data Availability

The data supporting the findings of this study are provided as a multimedia appendix to facilitate full reproducibility.

All authors were involved in the conception and/or design of the study and contributed critically to the final preparation of this study, including approving the final version of the manuscript. In particular, SK conceived and designed the study, wrote the final study protocol, and drafted the manuscript. RA conducted the study, collected the results, and analyzed the data. SH and JB analyzed the data and performed and verified the statistical analyses.

None declared.

Abbreviations

WGS

Weighted Gain Score

RLG

Rasch Learning Gain

References1

Charalambous

Praetorius

Sammons

Walkowiak

Jentsch

Kyriakides

Working more collaboratively to better understand teaching and its quality: challenges faced and possible solutions

Stud Educ Eval202112711101092

10.1016/j.stueduc.2021.101092

Gibson

Boyle

Black

Cunningham

Grimm

McNeil

Enhancing evaluation in an undergraduate medical education program

Acad Med200808838787793

10.1097/ACM.0b013e31817eb8ab

18667897

Litzelman

Stratos

Marriott

Skeff

Factorial validation of a widely disseminated educational framework for evaluating clinical teachers

Acad Med199806736688695

10.1097/00001888-199806000-00016

9653408

Noor

Hozan

Vîlceanu

Bonțea

A review of the effectiveness of the role of various components in medical education

Arch Pharm Pract2023144155159

10.51847/LrElkFGJAO

McGaghie

Issenberg

Cohen

Barsuk

Wayne

Medical education featuring mastery learning with deliberate practice can lead to better health for individuals and populations

Acad Med2011118611e8e9

10.1097/ACM.0b013e3182308d37

22030671

Gould

Grey

Huntington

Improving patient care outcomes by teaching quality improvement to medical students in community-based practices

Acad Med200210771010111018

10.1097/00001888-200210000-00014

12377677

Schiekirka-Schwake

Anders

von Steinbüchel

Becker

Raupach

Facilitators of high-quality teaching in medical school: findings from a nation-wide survey among clinical teachers

BMC Med Educ20170929171178

10.1186/s12909-017-1000-6

28962568

Schiekirka

Reinhardt

Beißbarth

Anders

Pukrop

Raupach

Estimating learning outcomes from pre- and posttest student self-assessments: a longitudinal study

Acad Med201303883369375

10.1097/ACM.0b013e318280a6f6

23348083

Gruppen

Outcome-based medical education: implications, opportunities, and challenges

Korean J Med Educ201212244281285

10.3946/kjme.2012.24.4.281

25813324

Harden

AMEE guide no. 14: outcome-based education: part 1-an introduction to outcome-based education

Med Teach199901211714

10.1080/01421599979969

Haverkamp

Barth

Schmidt

Dahmen

Keis

Raupach

Position statement of the GMA committee “teaching evaluation”

GMS J Med Educ2024412Doc19

10.3205/zma001674

38779701

Fraenkel

Wallen

Hyun

How to Design and Evaluate Research in Education20128

McGraw-Hill

Cook

Beckman

Reflections on experimental research in medical education

Adv Health Sci Educ Theory Pract201008153455464

10.1007/s10459-008-9117-3

18427941

Colt

Davoudi

Murgu

Zamanian Rohani

Measuring learning gain during a one-day introductory bronchoscopy course

Surg Endosc201101251207216

10.1007/s00464-010-1161-4

20585964

McGrath

Guerin

Harte

Frearson

Manville

Learning gain in higher education

RAND Corporation2015

2026-05-25

https://www.rand.org/pubs/research_reports/RR996.html

Westphale

Backhaus

Koenig

Quantifying teaching quality in medical education: the impact of learning gain calculation

Med Educ202203563312320

10.1111/medu.14694

34767274

Šimkovic

Träuble

Robustness of statistical methods when measure is affected by ceiling and/or floor effect

PLoS One2019148e0220889

10.1371/journal.pone.0220889

31425561

Bereiter

Harris

Some persisting dilemmas in the measurement of change

Problems in Measuring Change1963

University of Wisconsin Press

320

Prieler

Raven

Problems in the measurement of change (with particular reference to individual change [gain] scores) and their potential solution using IRT

Uses and Abuses of Intelligence: Studies Advancing Spearman and Raven’s Quest for Non-Arbitrary Metrics2008

Royal Fireworks Press

173210

Hake

Interactive-engagement versus traditional methods: a six-thousand-student survey of mechanics test data for introductory physics courses

Am J Phys1998016616474

10.1119/1.18809

Coletta

Phillips

Interpreting FCI scores: normalized gain, preinstruction scores, and scientific reasoning ability

Am J Phys200512731211721182

10.1119/1.2117109

Nissen

Talbot

Thompson

Van Dusen

Comparison of normalized gain and Cohen’s d for analyzing gains on concept inventories

Phys Rev Phys Educ Res201814010115

10.1103/PhysRevPhysEducRes.14.010115

Marx

Cummings

Normalized change

Am J Phys20077518791

10.1119/1.2372468

Pentecost

Barbera

Measuring learning gains in chemical education: a comparison of two methods

J Chem Educ201307907839845

10.1021/ed400018v

Bond

Fox

Applying the Rasch Model: Fundamental Measurement in the Human Sciences20153

Routledge

Downing

Item response theory: applications of modern test theory in medical education

Med Educ200308378739745

10.1046/j.1365-2923.2003.01587.x

12945568

Embretson

Reise

Item Response Theory2013

Psychology Press

Wallace

Bailey

Do concept inventories actually measure anything?

Astron Educ Rev201091

10.3847/AER2010024

Backhaus

Huth

Entwistle

Homayounfar

Koenig

Digital affinity in medical students influences learning outcome: a cluster analytical design comparing vodcast with traditional lecture

J Surg Educ2019763711719

10.1016/j.jsurg.2018.12.001

30833205

Chen

Lenderking

Jin

Wyrwich

Gelhorn

Revicki

Is Rasch model analysis applicable in small sample size pilot studies for assessing item characteristics? An example using PROMIS pain behavior item bank data

Qual Life Res201403232485493

10.1007/s11136-013-0487-5

23912855

R Core Team

The R Project for Statistical Computing

R Foundation for Statistical Computing2013

2026-05-25

http://www.R-project.org

Mair

Hatzinger

Extended Rasch modeling: the eRm package for the application of IRT models in R

J Stat Soft2007209120

10.18637/jss.v020.i09

Micceri

The unicorn, the normal curve, and other improbable creatures

Psychol Bull19891051156166

10.1037/0033-2909.105.1.156

Bishara

Hittner

Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches

Psychol Methods201209173399417

10.1037/a0028087

22563845

Knief

Forstmeier

Violating the normality assumption may be the lesser of two evils

Behav Res Methods20211253625762590

10.3758/s13428-021-01587-5

33963496

Havlicek

Peterson

Robustness of the Pearson correlation against violations of assumptions

Percept Mot Skills197612433_suppl13191334

10.2466/pms.1976.43.3f.1319

Boerboom

TBB

Stalmeijer

Dolmans

DHJM

Jaarsma

DADC

How feedback can foster professional growth of teachers in the clinical workplace: a review of the literature

Studies in Educational Evaluation201509464752

10.1016/j.stueduc.2015.02.001

Scheeler

Ruhl

McAfee

Providing performance feedback to teachers: a review

Teach Educ Spec Educ2004274396407

10.1177/088840640402700407

Evans

Howson

Forsythe

Making sense of learning gain in higher education

Higher Education Pedagogies201831145

10.1080/23752696.2018.1508360

Ewert

Sibthorp

Creating outcomes through experiential education: the challenge of confounding variables

Journal of Experiential Education2009011313376389

10.5193/JEE.31.3.376

Kubinger

Gottschall

Item difficulty of multiple choice tests dependant on different item response formats—an experiment in fundamental research on psychological assessment

Psychol Sci2007494361374

Weber

Knapp

Ickstadt

Kundt

Glass

Zero-cell corrections in random-effects meta-analyses

Res Synth Methods202011116913919

10.1002/jrsm.1460

32991790

Sweeting

Sutton

Lambert

What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data

Stat Med2004051523913511375

10.1002/sim.1761

15116347

Aung

Jurak

Mehmood

Axon

Sensitivity analysis in meta-analysis: a tutorial

Cochrane Evid Synth Methods20260141e70067

10.1002/cesm.70067

41497796

O’Neill

Gregg

Peabody

Effect of sample size on common item equating using the dichotomous Rasch model

Appl Meas Educ2020013311023

10.1080/08957347.2019.1674309

Downing

Reliability: on the reproducibility of assessment data

Med Educ20040938910061012

10.1111/j.1365-2929.2004.01932.x

15327684

Tavakol

Dennick

Making sense of Cronbach’s alpha

Int J Med Educ2011062725355

10.5116/ijme.4dfb.8dfd

28029643

de Vet

Mokkink

Mosmuller

Terwee

Spearman-Brown prophecy formula and Cronbach’s alpha: different faces of reliability and opportunities for new applications

J Clin Epidemiol201705854549

10.1016/j.jclinepi.2017.01.013

28342902

Brown

Some experimental results in the correlation of mental abilities

Br J Psychol 1904-192019101033296322

10.1111/j.2044-8295.1910.tb00207.x

Spearman

Correlation calculated from faulty data

Br J Psychol 1904-192019101033271295

10.1111/j.2044-8295.1910.tb00206.x

Rindskopf

Overview of Bayesian statistics

Eval Rev202008444225237

10.1177/0193841X19895623

31894697

Multimedia Appendix 1

Decision framework for selecting the calibration parameter µ in Weighted Gain Score calculations according to the evaluation objective (absolute adjustment, relative adjustment, or routine evaluation).