Use of Multiple-Select Multiple-Choice Items in a Dental Undergraduate Curriculum: Retrospective Study Involving the Application of Different Scoring Methods

Background: Scoring and awarding credit are more complex for multiple-select items than for single-choice items. Forty-one different scoring methods were retrospectively applied to 2 multiple-select multiple-choice item types (Pick-N and Multiple-True-False [MTF]) from existing examination data. Objective: This study aimed to calculate and compare the mean scores for both item types by applying different scoring methods, and to investigate the effect of item quality on mean raw scores and the likelihood of resulting scores at or above the pass level ( ≥ 0.6). Methods: Items and responses from examinees (ie, marking events) were retrieved from previous examinations. Different scoring methods were retrospectively applied to the existing examination data to calculate corresponding examination scores. In addition, item quality was assessed using a validated checklist. Statistical analysis was performed using the Kruskal-Wallis test, Wilcoxon rank-sum test, and multiple logistic regression analysis ( P <.05). Results: We analyzed


Introduction
In dentistry, multiple-choice items are often used to test theoretical knowledge in written examinations [1]. Multiple-choice items can be divided into single-choice items (eg, Type A) and multiple-select items. In multiple-select items, examinees are required to judge multiple answer options/statements independently within a single item. The correctness of an answer option/statement does not affect the other answer options/statements within the same item. Therefore, a more active knowledge reproduction takes place as examinees cannot identify the correct answer option at the first glance and must not ignore the remaining answer options. In contrast to single-choice items, scoring of multiple-select items is more complex. While examinees' responses on single-choice items might be either correct (1 full credit point is awarded) or incorrect (no credit points are awarded or a penalty score is given), multiple-select items might result in partially correct responses (ie, some answer options/statements are marked correctly while others are marked incorrectly).
Within electronic written examinations of dental undergraduate students at the University Medical Center Göttingen, Type A single-choice items and 2 kinds of multiple-select multiple-choice items, known as Pick-N [2,3] and Multiple-True-False (MTF) [4], are used. Examples of the item types are shown in Figure 1. Since the first mention of these item types, various scoring methods for scoring multiple-select items have been described in the literature. A summary of different scoring methods and their corresponding mathematical scoring algorithms as identified by 2 recent systematic reviews [5,6] is shown in Multimedia Appendix 1.
Pick-N items consist of a variable number of answer options (with the number [n] ranging from 5 to 26 [7][8][9]), and examinees are asked to select all true answer options. The total number of true answer options (t) within each item is disclosed to examinees and might vary between 2 and n-1 [3,7,[9][10][11]. In recent years, Pick-N items were described to typically consist of 1 circumscript question and a number of very short answer options (ie, a single word or very short phrases) [7,10]. This item type has also been named k from n and n out of many in the literature [8,9].
MTF items consist of a question stem and a variable number of statements (ie, complex statements as opposed to very short answer options used in Pick-N items), which need to be labeled independently as true or false by examinees. Any number of statements (including zero and n) might be correct, and the number of true statements is not disclosed. This item type has also been named true-false format, cluster-true-false, cluster (multiple true-false) variety, cluster-type true-false, Kprim, Kprime, K', and Type X in the literature [12][13][14][15][16]. Based on the above-mentioned definitions of Pick-N and MTF items, the example shown in Figure 1 should be employed as a Pick-N item instead of an MTF item.
Although indications for the use of multiple-select multiple-choice items and corresponding instructions for examinees vary between both item types [7,10], it is unknown whether educators employ Pick-N and MTF items according to the above-mentioned recommendations. Moreover, the relation between examinees' true ability (ie, true knowledge) and expected scoring results differs between both item types [5,6]. In case of examinations consisting of single-choice items with 5 answer options only (ie, with a guessing probability amounting to 20%), a pass mark of 60% tests examinees for a level of 50% true knowledge, as examinees with 50% true knowledge achieve 60% of the possible total score on average due to the possibility of guessing (using an all-or-nothing scoring method without applying a penalty for incorrect responses). Depending on the employed multiple-select item type, the number of answer options/statements per item, and the used scoring method, examinees might require either more or less true knowledge to gain 60% of the possible total score on average. Therefore, this study aimed to (1) retrospectively apply different scoring methods to existing examination data from multiple-select multiple-choice items and analyze the obtained results from examinees (ie, scores) and (2) investigate the impact of item characteristics (ie, selection of appropriate item type and presence of cues) on scoring results (ie, mean raw scores and the likelihood of resulting scores at or above pass level when using different scoring methods).
The null hypotheses were as follows: (1) scoring results for Pick-N and MTF items do not differ between different scoring methods and (2) item characteristics do not impact scoring results.

Ethical Considerations
Owing to the retrospective design of the study and the fact that only anonymized item scores at the level of previous examinations (ie, not at the level of identifiable students) were available from the examination software, no ethical approval was required.

Multiple-Select Multiple-Choice Items
At the University Medical Center Göttingen, both Pick-N and MTF multiple-select multiple-choice items are used. While Pick-N items might contain a variable number of answer options (up to 26), local examination guidelines recommend 5, 6, 7, or 8 answer options. According to local examination guidelines, MTF items might contain 4, 5, or 6 statements.
For Pick-N items, a total of 24 different scoring methods have been described in the literature [6]. Moreover, for MTF items, a large variety of scoring methods exist, and a total of 27 scoring methods have been described in the literature [5]. By removing duplicate scoring algorithms, 41 scoring algorithms were identified and were retrospectively applied to examinees' markings of both multiple-select multiple-choice item types.

Electronic Examinations
Prior to their use, all items were subjected to a review process at the department responsible for the respective examination. During electronic examinations, answer options/statements were displayed and permuted for each examinee using UCAN's CAMPUS Examination software [17]. Until the end of the examination, examinees were able to modify their markings. Total examination time was calculated based on 90 seconds per item.
For Pick-N items, examinees had to mark only the true answer options (t). For each item, the number of true answer options was displayed to the examinees. Marking more answer options as true than the given number of t was technically impossible. If examinees marked fewer answer options than t as true, a warning message was shown indicating that they were intended to select t answer options. Despite the warning message, examinees were allowed to continue without selecting t answer options. Within the context of MTF items, examinees were required to mark each statement as either true or false, and there was no possibility to omit individual statements.
For all examinations (usually consisting of 20 to 30 items), a uniform pass mark of 60% (ie, 0.6 credit points) was used irrespective of the included item types according to local examination guidelines.

Examination Data
Written examinations of the Department of Preventive Dentistry, Periodontology and Cariology and the Department of Prosthodontics of the undergraduate dental curriculum (1st to 10th semester) at the University Medical Center Göttingen were retrospectively screened for multiple-select multiple-choice items. Due to the overall lower number of Pick-N items, Pick-N items and examination data were retrieved from all examinations with at least five participants between 2016 and 2020. In case of Pick-N items used in multiple examinations, only the version and marking events from the examination with the most examinees or the first examination (in cases of the same number of examinees) were assessed. MTF items and corresponding examination data were retrieved from a previous publication [18] containing items from examinations with at least five participants during winter term 2016/2017 only. If MTF items were used in multiple eligible examinations, marking events from all examinations were combined. To allow for comparison, MTF items from the previous publication were limited to the fields of Operative Dentistry and Prosthodontics.

Quality Criteria of Items
Judgement regarding the use of an appropriate item type was based on the definition by Krebs [10]. In order to further evaluate the quality of identified items, a validated checklist regarding formal quality criteria, presence of cues, and content correctness was used (Table 1) [18]. Formal quality and presence of cues were jointly assessed by 3 authors (PK, MH, and TR) to classify items for the subsequent analyses. Content validity was assessed by 2 expert clinicians (AW for items within the field of Operative Dentistry; TW for items regarding Prosthodontics). Table 1. Checklist for the quality assessment of items as described previously [18].

Statistical Analysis
Scoring results for all marking events (ie, individual student entries on a single item) of identified Pick-N and MTF items were calculated according to the identified scoring algorithms shown in Multimedia Appendix 1, using Excel for Mac (version 16.39; Microsoft Corp). Based on these results, a mean score across all examinees and items was calculated for each scoring algorithm and item type. Separately for Pick-N and MTF items, differences between the mean scores of all scoring methods were assessed by the Kruskal-Wallis test.
The effect of item quality (use of an appropriate item type [yes vs no] and absence of cues [yes vs no]) on mean raw scores was assessed by the Wilcoxon rank-sum test. Raw scores were derived from method 10 (Partial Scoring 1/n, PS 1/n ), which awards partial credit equally for each correctly marked answer option/statement. Separately for each scoring method, the likelihood of achieving a score of ≥0.6 was assessed by multiple logistic regression analyses. The use of an inappropriate item type (yes vs no) and presence of cues (yes vs no) were simultaneously entered as predictor variables. A dichotomous outcome was defined as a score at or above pass mark (≥0.6 credit points) versus below pass mark (<0.6 credit points).
All calculations were performed using the software R [19] (version 4.0.4) and the package "PMCMR" (version 4.3). The level of significance was set at α=.05.

Marking Events
A total of 48 Pick-N and 18 MTF items were included. Items presented 5, 6, or 7 answer options (Pick-N), or 5 or 6 statements (MTF). A total of 1931 (Pick-N) and 828 (MTF) marking events were investigated. On average, for Pick-N and MTF items, each item was answered by 40.2 (SD 5.7) and 46.0 (SD 30.7) examinees.

Scoring Results
Except for method 9 (Monash Medical School Scheme), which has only been described for cases of n=4, all identified scoring methods were applied on all included items.
For both item types, mean scores differed significantly between scoring methods (P<.001). For Pick-N items, mean scores per item varied between 0.5, when applying method 16 (Guessing Penalty), and 0.98, when applying method 2 (Dichotomized MTF) or method 32 (Formula 3 by Blasberg et al [8]). Overall, mean scores of ≥0.90 per item were achieved when using method 2 (Dichotomized MTF), method 32 (Formula 3 by Blasberg et al [8]), method 15 (Guessing Fair Penalty), or method 29 (Formula 6 by Duncan and Milton [20]      Within Pick-N items, the presence of cues was associated with a greater likelihood of achieving a score of ≥0.6 (equaling scores at or above the pass mark that is ≥60% of the total score) for a minority of scoring methods only (

Principal Findings
When retrospectively applying the described scoring methods on examination items, the applied scoring method, presence of cues, and use of an inappropriate item type impacted the credit assignment. Therefore, both null hypotheses must be rejected.
Averaged scores differed significantly between different scoring methods for both item types. For Pick-N items, mean scores ranged from 0.50 (method 16) to 0.98 (method 2) credit points for the same markings, while MTF items showed an even bigger range of 0.02 (method 16) to 0.96 (method 2) credit points. Both the use of an inappropriate item type and presence of cues significantly impacted the scoring results. Inappropriately used Pick-N items resulted in lower mean raw scores (mean 0.88, SD 0.20 vs mean 0.93, SD 0.16), while inappropriately used MTF items resulted in higher mean raw scores (mean 0.88, SD 0.19 vs mean 0.85, SD 0.17). The mean raw score from MTF items with cues was 0.91 (SD 0.15), while items without cues resulted in a lower mean raw score of 0.84 (SD 0.18). These differences emphasize the effects of different scoring methods, presence of cues, and inappropriately used item types, as examinees might either pass or fail the examination based on an assumed fixed pass mark of 60% (ie, 0.6 credit points on average). For most scoring methods, item quality impacted the likelihood of scores ≥0.6. Inappropriately used Pick-N items showed a lower likelihood of scores ≥0.6, while inappropriately used MTF items showed a higher likelihood of scores ≥0.6. MTF items containing at least one cue showed a higher likelihood of scores ≥0.6 than items without cues.
Two different types of multiple-select multiple-choice items were used in this study. Between Pick-N and MTF items, examinees' decision-making and response behaviors are fundamentally different. In Pick-N items, the number of true answer options to be selected is disclosed to examinees. Therefore, marking answer options within Pick-N items is dependent on the marking of all other answer options within the same item [6]. The metric expected chance score [23] from random guessing amounts to . In contrast, every statement within an MTF item might be either true or false (including zero or even all statements). Thereby, examinees are forced to independently assess each statement as true or false, and the expected chance score amounts to 0.5 n [5]. Based on these theoretical implications, lower mean scores can be expected if examinees are not aware of the total number of correct answer options/statements (such as in MTF items). To address these differences regarding the relative item difficulty between both item types, local examination guidelines might suggest different scoring methods or pass marks for both item types. This study found scores resulting from both Pick-N and MTF items to vary based on the selected scoring methods. Therefore, examination results should only be interpreted in light of the employed scoring method or methods.
Within this study, items were extracted from different examinations covering a broad range of topics and learning objectives. Therefore, no direct comparison of the item difficulty between MTF and Pick-N items was made. Instead, the effect of item quality was assessed. Inappropriately used MTF items resulted in higher mean raw scores, while inappropriately used Pick-N items resulted in lower mean raw scores. This observation might be attributed to the definitions regarding the correct use of Pick-N and MTF items. MTF items require more complex statements than Pick-N items [7,10]. As a result, MTF items are likely to be overall more complex, requiring higher cognitive skills from examinees. If local examination guidelines suggest different scoring methods or pass marks for both item types to overcome the above-mentioned differences between both item types, the use of an inappropriate item type might result in either an inflation (in case of inappropriately used MTF items) or deduction (in case of inappropriately used Pick-N items) of scores at or above the pass mark.
Besides item types used inappropriately, cues were found to impact scoring results. While the mean raw scores of Pick-N items with and without cues did not differ, the presence of cues in MTF items resulted in a higher proportion of correctly marked statements. Thus, MTF showed a higher susceptibility to cues. As examinees are likely to consider cues during their decision-making process, educators should carefully evaluate each item using a checklist for quality assessment and cues (eg, grammar hints, diametrical statements, or absolute formulations) to eliminate cues prior to its use in an examination.
Besides selecting an appropriate item type, educators need to select an adequate scoring method. In contrast to single-choice items, scoring of multiple-select items is complicated as examinees might give partially correct responses. In recent systematic reviews, a total of 41 scoring methods for MTF and Pick-N items were described [5,6]. Scoring methods focusing on the number of correct responses instead of the number of true answer options/statements marked as true (t m ) and accurately discriminating between different levels of knowledge are most frequently recommended [5]. Scoring methods yielding negative scores should not be used because of jurisdictional reasons [5,18,24]. However, available item types and scoring methods are often set by local examination guidelines.
Overall, the results of this retrospective assessment of real examination data confirm the assumption that credit assignment on MTF and Pick-N items differs between varying scoring methods. Furthermore, it was shown that item quality characteristics like selection of an appropriate item type and avoidance of cues have a significant effect on scoring results in the case of most scoring methods.

Strengths and Limitations
The strengths of this assessment include the use of up to 41 scoring methods and a high number of marking events (Pick-N items: 1931; MTF items: 828). Previous studies on this topic were based on theoretical calculations only [5,6] or used a smaller number of different scoring methods/item types [18]. For each item, quality was assessed based on a validated checklist. However, a number of limitations are present. First, items were derived from previous examinations, which resulted in an unequal distribution of both item types. While 48 Pick-N items were included, only 18 MTF items were assessed. Second, all items were extracted from different examinations covering a broad range of topics. Therefore, no direct comparison of the item difficulty between MTF and Pick-N items was possible. Third, no further predictor variables (eg, student-related variables such as age and gender) were available due to the retrospective and anonymous design.

Future Directions
To address these limitations, further prospective studies should evaluate different scoring methods and item types by employing matched items on the same learning objectives. Moreover, further predictor variables (eg, student-related variables such as age and gender) should be considered.

Conclusion
Educators should pay attention when using multiple-select multiple-choice items. Scoring and awarding credit are more complex for multiple-select multiple-choice items than for single-choice items. This manuscript may guide educators to make informed decisions regarding the use of multiple-select multiple-choice items.
Different item types, different scoring methods, and presence of cues are likely to impact examinees' scores and overall examination results. Therefore, educators should carefully select the most appropriate item type. Moreover, cues should be avoided as far as possible. Finally, examination results should be interpreted in light of the used item type and applied scoring method.