Background: Accurate and timely discharge summaries are critical to safe transitions of hospitalized patients to the outpatient setting and reduce readmissions1–6. However, composing the hospital course (HC) section of the discharge summary is laborious and time consuming, contributing to increased documentation burden7,8, work outside of work9, and clinician burnout10,11. Recent retrospective studies demonstrated that large language models (LLMs) can generate HCs with comparable quality as physicians12,13.

Purpose: In this study, we address the pressing need for prospective, real-world evaluation of such tools14,15 by implementing and evaluating the safety and utility of MedAgentBrief, an LLM-based HC generator. Primary outcome was physician-reported potential for and severity of harm from the unedited HC; secondary outcomes included utilization rate of generated HCs in the final discharge summary, time in discharge summaries, time from discharge to completion of discharge summary (chart closure time), and change in cognitive burden (NASA Task Load Index [TLX]) and burnout (Stanford PFI Work Exhaustion Scale).

Description: MedAgentBrief is a custom LLM pipeline leveraging Gemini 2.5 Pro (Google) that takes all signed physician notes from the admission as input and generates a narrative and problem-based HC. MedAgentBrief was piloted at a single medical inpatient unit affiliated with an academic medical center staffed exclusively by attending hospitalists from 8/1/2025-10/11/2025. MedAgentBrief HCs were batch generated at midnight and securely e-mailed to the attending physician daily along with a feedback form. Pre- and post-pilot surveys were also administered. Baseline time metrics were calculated during a pre-pilot period from 4/9/2025-7/31/2025.Over the 10-week pilot, MedAgentBrief generated 384 HCs for 11 physicians, of which 219 (57%) were used with edits in the final discharge summary. Physicians provided feedback on 100 (40.2%) MedAgentBrief HCs, including 12 that were not incorporated in the final discharge summary. Omissions were reported in 25% of summaries with feedback, inaccuracies in 20%, and hallucinations in 2%. Despite this, physicians rated 88 (88%) unedited HCs as having no harm potential, 1 (1%) extremely likely to cause mild harm, 9 (9%) unlikely to cause mild harm, 1 (1%) likely to cause mild harm, and 1 (1%) likely to cause moderate harm (Figure 1). Of the 7 physicians with baseline time data available, 5 (71%) saw reductions up to 2.9 minutes in median time spent per discharge summary, whereas 2 (29%) had increases up to 1.5 minutes (Figure 2). There was no significant change in median chart closure time in 6 (86%), but 1 (14%) had a significant increase of 4.3 days. Across 10 pre-/post-survey pairs (91% response rate), user-reported mean NASA-TLX did not change significantly (57.5 to 52.3, p = 0.3) but burnout score decreased from 1.75 to 1.2 (p = 0.03) following the pilot. Limitations include small sample size, lower patient acuity, and single site design.

Conclusions: Our results from this prospective, real-world implementation of MedAgentBrief, an LLM-based hospital course summarizer, demonstrate its ability to provide safe and useful summaries and highlight its potential to mitigate clinician documentation burden and burnout.

IMAGE 1: Figure 1: User-rated harm severity and likelihood of said harm happening due to unedited MedAgentBrief HC.

IMAGE 2: Figure 2: Change in median time spent per discharge summary before and during pilot.