Background: The personal statement (PS) is a crucial factor in a residency program’s evaluation of applicants and has traditionally been helpful in understanding applicants’ journeys in medicine, values, and written communication skills. Now, with accessible large language models (i.e. ChatGPT) designed for language composition, there is interest in how these models may be utilized by applicants to generate PS. This study evaluated whether faculty experienced in reading PS from Internal Medicine residency applicants could distinguish between authentic and GPT-4 generated PS.

Methods: 223 applicants from the 2019 application year were randomly selected from an ACGME-accredited Internal Medicine residency program. The PS of 100 of these applicants were extracted from their application file and de-identified. The CVs of the remaining 123 applicants were also extracted and de-identified. Of the 123 de-identified CVs, 23 were used for prompt development and excluded from evaluation. To develop a prompt for GPT-4, we utilized an iterative process known as prompt engineering. Our initial prompt consisted of simple instructions to generate a PS based on a de-identified CV. The generated PS was then evaluated and the prompt was modified repeatedly, undergoing 5 rounds of engineering. The final prompt was then used to generate 100 PS from the 100 de-identified CVs not used for prompt engineering. Four faculty with extensive application review experience each blindly evaluated a set of 100 statements (50 authentic and 50 GPT-4 generated) via secure survey, ensuring each PS was evaluated twice. The primary outcome was accuracy in distinguishing authentic vs GPT-4 generated PS. The secondary outcomes included PS favorability (1 to 10 scale). Qualitative responses from reviewers regarding PS authenticity and favorability were also collected. Statistical significance was defined as 2-sided P< 0.05. We calculated 95% confidence intervals using generalized estimating equation (GEE) to account for potential correlation within a PS.

Results: Faculty reviewers correctly identified PS as authentic or AI-generated 75% of the time [95% CI: 70% to 79%]. A reviewer learning trend was identified with an average improvement in accuracy of 3.2% with every additional 10 PS reviewed [slope β:0.032 , 95% CI: 0.011 to 0.052, p=0.008]. The GPT-4 generated statements were rated 0.73 points lower than authentic PS [95% CI: -1.0 to -0.45, p< 0.001]. PS perceived by reviewers to be AI-generated were rated 1.4 points lower than PS perceived to be authentic [95% CI: -1.7 to -1.2, p< 0.001]. The top 20 highest rated statements overall were all authentic.

Conclusions: Experienced faculty reviewers were able to reliably distinguish between authentic and GPT-4 generated personal statements. Furthermore, authentic statements were rated significantly higher than those generated by GPT-4. It is important to note that after GPT-4 was used to generate a PS, it was not prompted further to modify its response in any way. Real world use will likely involve a more collaborative process between an applicant and the AI, resulting in a PS of hybrid authorship which may be more difficult to distinguish from a completely authentic PS. Additionally, the use of a single prompt may have limited GPT-4 responses to certain themes and structures which may have been learned by reviewers after multiple reviews. Regardless, when comparing authentic personal statements with those entirely generated by GPT-4, the highest rated statements continue to be those of human authorship.

IMAGE 1: Figure 1: Comparison of Favorability Rating Score (1-10) Between GPT- 4 and Human Generated PS

IMAGE 2: Figure 2: Comparison of Favorability Rating Score (1-10) Between PS Perceived to be AI and Human Generated