Background: Clinical reasoning is a core component of medical training yet learners receive very little formative feedback on their clinical reasoning documentation. We hypothesize that this is related to the lack of a shared assessment rubric and faculty time constraints.

Purpose: Here we describe the process of developing a machine learning algorithm for feedback on clinical reasoning documentation to help increase the frequency and quality of feedback in this domain.

Description: To create this algorithm, note quality first had to be rated by “gold standard” human rating. After conducting a literature review, we selected the IDEA assessment tool as the gold standard tool. The IDEA assessment tool offers a framework for assessment of clinical reasoning documentation using a 3-point Likert scale to rate documentation in four domains: I=Interpretive summary that is concise using semantic qualifiers, D=Differential diagnosis with commitment to the most likely diagnosis, E=Explanation of reasoning in choosing the most likely diagnosis, A=Alternative diagnoses with explanation of reasoning. However, this tool does not have detailed descriptive anchors for its Likert scale. To develop descriptive anchors we conducted an iterative process reviewing notes from the NYU Langone EHR written by internal medicine residents and validated the revised IDEA assessment tool using components of Messick’s framework–content validity, response process, and internal structure. The revised IDEA assessment tool scale had a range of 1-10 with a cutoff of 7 deemed as high quality clinical reasoning independently determined by 3 raters. Using this human rating tool, we then created a training dataset of expert-rated notes to train and validate the machine learning algorithm. 252 notes were rated and text of the notes was highlighted for language that conveyed clinical reasoning. Next, keywords that conveyed clinical reasoning were determined by rapid automatic keyword extraction (RAKE) an extraction algorithm which determines key phrases in a body of text. Finally, each of the 252 notes was chunked by 10 words and we determined which chunks contained clinical reasoning by calculating Levenshtein distance (a metric for measuring the difference between two sequences) between the chunk and the keywords determined by RAKE. Chunks that had >=40% identity were considered to have clinical reasoning present. Using this training dataset, we built machine learning models to identify percentage of clinical reasoning in notes.Twenty percent of the 252 notes were rated by 3 raters with high intraclass correlation 0.84 (95% CI 0.74-0.90). Mean score of the note ratings was 5.75 (SD 2.01). Sixty-nine percent of the notes were rated as low quality. The best performing machine learning algorithm was the logistic regression model with an AUC 0.88. In the machine learning model predictions, low performing residents were easy to identify with a clinical reasoning percentage in the 0-15% range. Higher performers had more variability in their clinical reasoning documentation.

Conclusions: Next steps are to conduct manual validation of the machine learning algorithm by comparing its output to the human rating gold standard. This validation process will help determine how to interpret the results of the model and whether further training of the model is required. Subsequently, we will pilot using the machine learning model output for feedback on residents clinical reasoning documentation and assess documentation quality pre- and post- feedback.