%0 Journal Article
%@ 2369-3762
%I JMIR Publications
%V 8
%N 2
%P e30537
%T Harnessing Natural Language Processing to Support Decisions Around Workplace-Based Assessment: Machine Learning Study of Competency-Based Medical Education
%A Yilmaz,Yusuf
%A Jurado Nunez,Alma
%A Ariaeinejad,Ali
%A Lee,Mark
%A Sherbino,Jonathan
%A Chan,Teresa M
%+ Division of Emergency Medicine, Department of Medicine, Faculty of Health Sciences, McMaster University, McMaster Clinics, Room 255, 237 Barton St E, Hamilton, ON, L8L 2X2, Canada, 1 905 525 9140, teresa.chan@medportal.ca
%K natural language processing
%K machine learning algorithms
%K competency-based medical education
%K assessment
%K medical education
%K medical residents
%K machine learning
%K work performance
%K prediction models
%D 2022
%7 27.5.2022
%9 Original Paper
%J JMIR Med Educ
%G English
%X Background: Residents receive a numeric performance rating (eg, 1-7 scoring scale) along with a narrative (ie, qualitative) feedback based on their performance in each workplace-based assessment (WBA). Aggregated qualitative data from WBA can be overwhelming to process and fairly adjudicate as part of a global decision about learner competence. Current approaches with qualitative data require a human rater to maintain attention and appropriately weigh various data inputs within the constraints of working memory before rendering a global judgment of performance. Objective: This study explores natural language processing (NLP) and machine learning (ML) applications for identifying trainees at risk using a large WBA narrative comment data set associated with numerical ratings. Methods: NLP was performed retrospectively on a complete data set of narrative comments (ie, text-based feedback to residents based on their performance on a task) derived from WBAs completed by faculty members from multiple hospitals associated with a single, large, residency program at McMaster University, Canada. Narrative comments were vectorized to quantitative ratings using the bag-of-n-grams technique with 3 input types: unigram, bigrams, and trigrams. Supervised ML models using linear regression were trained with the quantitative ratings, performed binary classification, and output a prediction of whether a resident fell into the category of at risk or not at risk. Sensitivity, specificity, and accuracy metrics are reported. Results: The database comprised 7199 unique direct observation assessments, containing both narrative comments and a rating between 3 and 7 in imbalanced distribution (scores 3-5: 726 ratings; and scores 6-7: 4871 ratings). A total of 141 unique raters from 5 different hospitals and 45 unique residents participated over the course of 5 academic years. When comparing the 3 different input types for diagnosing if a trainee would be rated low (ie, 1-5) or high (ie, 6 or 7), our accuracy for trigrams was 87%, bigrams 86%, and unigrams 82%. We also found that all 3 input types had better prediction accuracy when using a bimodal cut (eg, lower or higher) compared with predicting performance along the full 7-point rating scale (50%-52%). Conclusions: The ML models can accurately identify underperforming residents via narrative comments provided for WBAs. The words generated in WBAs can be a worthy data set to augment human decisions for educators tasked with processing large volumes of narrative assessments. 
%M 35622398
%R 10.2196/30537
%U https://mededu.jmir.org/2022/2/e30537
%U https://doi.org/10.2196/30537
%U http://www.ncbi.nlm.nih.gov/pubmed/35622398