Sepsis Detection ML: Features to Clinical Trials

A technical roadmap for building, explaining, and prospectively validating sepsis ML models from EHR features to clinical trials.

Early sepsis detection is one of the most clinically meaningful—and technically unforgiving—problems in healthcare AI. The opportunity is clear: identify deterioration before shock, organ failure, or ICU transfer, and you can materially improve outcomes while reducing costs. But the path from a promising model to a trusted bedside tool is not just about AUC. It requires thoughtful feature engineering from EHRs, robust time-series handling, careful threshold tuning to control false positives, and validation designs that survive the reality of clinical workflows.

This guide is written for ML engineers who need a technical roadmap, not a marketing pitch. We’ll cover how to choose features from structured EHR data and NLP notes, how to model irregular vitals and labs over time, how to reduce alert fatigue, and how to design prospective validation studies with clinicians. Along the way, we’ll borrow practical lessons from adjacent engineering domains like reliable cross-system automation testing, CI pipelines for simulation-heavy systems, and local development environments with reproducible dependencies—because clinical ML succeeds or fails on the same fundamentals: traceability, determinism, and safe deployment.

1) Why early sepsis detection is a hard ML problem

1.1 Sepsis is a moving target, not a clean label

Sepsis definitions have evolved, and in practice the label is often delayed, noisy, or inconsistently documented. That means your target can be contaminated by documentation timing, clinician behavior, billing patterns, and retrospective abstraction choices. A model that appears strong during offline evaluation may simply be learning how hospitals document deterioration rather than predicting physiological decline.

This is why the first design decision is not architecture—it is label strategy. Decide whether you are predicting severe sepsis onset, septic shock, ICU transfer, or a composite endpoint based on clinical need. For example, if your operational goal is to trigger an escalation review early enough for antibiotics and fluids, your prediction horizon and target window must reflect that workflow rather than a generic mortality endpoint. A well-grounded approach here resembles how teams in other data-heavy domains distinguish signal from administrative noise, as described in building robust bots when third-party feeds can be wrong.

1.2 False positives are not a side issue; they are the product

Sepsis models are often evaluated as if sensitivity alone were the prize. In reality, high false positives can destroy trust, increase alarm fatigue, and reduce clinical adoption even if the offline metrics look impressive. At the bedside, every unnecessary alert competes with dozens of other demands, and clinicians will quickly downgrade a system that disrupts workflow without helping decision-making.

Think of threshold tuning as a product decision, not just an ML calibration step. Your operating point should be selected with clinicians, by unit type, and by capacity to respond. A medical-surgical ward, an emergency department, and an ICU all have different prevalence, staffing, and alert tolerance. For a broader perspective on matching solution maturity to operational context, see this stage-based framework for workflow automation.

1.3 Prospective value depends on measurable clinical action

The question is not whether the model can rank patients by risk; it is whether the system changes care in a beneficial way. If a prediction never leads to earlier reassessment, labs, antibiotics, fluids, or a higher level of monitoring, it is just an elegant dashboard. That is why sepsis programs must be built with clinician stakeholders from day one and tested as interventions, not just classifiers.

Market growth reflects this reality. The decision-support category is expanding because hospitals want earlier detection, real-time EHR integration, and lower mortality, but adoption depends on trust, interoperability, and demonstrated workflow impact. That aligns with trends seen in systems that optimize for downstream ranking and usage signals: performance matters, but distribution and usability matter too.

2) Building the prediction target and labeling strategy

2.1 Start with a clinically meaningful prediction horizon

The ideal sepsis model predicts deterioration early enough to intervene, but not so early that risk becomes too diffuse to act on. Common horizons include 3, 6, and 12 hours before a target event, each with tradeoffs. Short horizons often improve specificity and actionable precision, while longer horizons provide more lead time but increase ambiguity in the physiology.

A practical pattern is to create multiple horizons and compare them in a retrospective development study. Then choose the one that matches response capacity: for example, a rapid response team might benefit from a 3–6 hour horizon, while a ward-based surveillance system might need 6–12 hours. This is similar to how teams in transport and logistics tune operations to lead time, as in strategies to mitigate delivery delays.

2.2 Use event windows and exclusion rules intentionally

Good labels are not just “positive” and “negative.” You need pre-event windows, observation windows, washout periods, and exclusions for patients with obviously confounding trajectories. If you include data after sepsis recognition, you risk leakage. If you keep patients with incomplete stays or atypical transfer patterns without accounting for censoring, you can distort the observed class balance.

One strong pattern is to define an index time and only use features available before that timestamp. This avoids inadvertent use of documentation that happens after escalation. It also mirrors best practices in assessment design that distinguishes polished answers from real understanding: what matters is whether the model knew it before the outcome, not whether it can explain the outcome after the fact.

2.3 Label adjudication should include clinical review

If feasible, create a clinician-adjudicated gold subset. Even a few hundred manually reviewed encounters can expose major issues: mislabeled infection timing, charting artifacts, and inconsistent onset definitions. This subset can also be used to estimate the ceiling for model performance and to diagnose whether errors are caused by weak features or weak labels.

In high-stakes healthcare ML, human review is not optional polish—it is an epistemic check. Similar concerns apply in trust-centered AI workflows, where validation requires more than automated scoring. For sepsis, the goal is not to eliminate judgment; it is to encode enough clinical reality to make the model useful.

3) Feature engineering from EHRs: structured data, temporal context, and NLP

3.1 Core structured EHR features: useful, but never enough alone

Vitals and labs remain the backbone of sepsis prediction: heart rate, respiratory rate, temperature, blood pressure, SpO2, WBC, creatinine, lactate, platelets, bilirubin, and mental status proxies. But the useful signal is rarely a single absolute value. It is often the trend, slope, deviation from baseline, variability, and the interaction between metrics over time. A heart rate of 110 means something different when paired with falling blood pressure, rising lactate, and increasing oxygen requirement.

Feature sets should include both static demographics and dynamic aggregates. Static features may include age, sex, comorbidities, prior admissions, immunosuppression, recent surgery, and unit location. Dynamic features can include rolling means, mins, maxes, slopes, deltas over 1/3/6/12 hours, and missingness indicators, because missingness itself often reflects care intensity or measurement cadence. For data collection pipelines, it helps to think like teams optimizing physical systems under constraints, as in secure IoT integration, where device state, update timing, and telemetry completeness all matter.

3.2 Time-series vitals: irregular sampling is the norm

Clinical time series are messy. Measurements are irregular, some variables are checked frequently only when the patient is unstable, and some labs arrive in batches with delayed timestamps. If your model assumes evenly spaced samples, you may introduce artifacts or discard the very complexity that carries predictive signal. The safer route is to preserve timing explicitly rather than force everything into a tidy but unrealistic grid.

Common approaches include forward filling with decay features, time-aware recurrent networks, temporal convolutional networks, transformers with masking, and gradient-boosted trees fed by engineered temporal summaries. The right choice depends on latency, deployment environment, and interpretability requirements. In many hospital implementations, a strong tabular baseline using well-designed rolling windows can outperform a more complex sequence model that is harder to calibrate and explain.

3.3 NLP from clinical notes can add context, but only if carefully governed

Clinical notes often contain highly valuable context: suspected infection source, clinician concern, subtle deterioration, culture orders, or documentation of altered mental status. NLP can extract features from admission notes, progress notes, nursing notes, and radiology impressions, but note timing must be respected. If a note is written after the event, it cannot be used to predict the event.

Use NLP to augment, not replace, structured signals. Practical text features include keyword flags, concept extraction with negation handling, temporal cues, and note embeddings filtered by timestamp. For production-readiness, you need an approach that is explainable enough for clinicians to trust. This is analogous to how teams in developer playbooks for major platform shifts must support new users without breaking compatibility.

3.4 Missingness can be informative, but model it explicitly

In EHR data, “not measured” often means “not clinically concerning enough to measure” or, paradoxically, “too unstable to normalize.” Missingness patterns can therefore be predictive, but only if modeled carefully. A missing lab indicator might capture the workflow of the ward, while an imputed value without a companion mask can mislead the model.

Best practice is to include missingness masks and time-since-last-observation features. This allows the model to differentiate a genuinely normal patient from a patient who has simply not been measured recently. If you want a model that works outside a single dataset, this discipline matters as much as feature choice itself, similar to how structured product data improves recommendation reliability across surfaces.

4) Modeling strategies for sepsis time series

4.1 Baselines first: regression and gradient-boosted trees

Before reaching for deep learning, build a strong baseline with logistic regression or gradient-boosted trees on engineered temporal features. These models are fast, usually easier to calibrate, and more defensible in a clinical review. They also create an upper-bounds check: if a complex model barely beats a well-tuned tree ensemble, the extra complexity may not be worth the risk.

Use baseline models to probe feature utility. Evaluate whether the model depends primarily on a few high-signal variables, whether it overfits to missingness, and whether it collapses under hospital-site changes. If a baseline captures most of the signal, it can also be a great production candidate because maintenance burden is lower and explainability is easier to communicate.

4.2 Sequence models are powerful, but they demand disciplined evaluation

RNNs, temporal CNNs, and transformers can exploit longer context and nonlinear temporal interactions that tabular models may miss. They are especially useful when rich multi-channel time series are available at fine granularity. However, these models are also more sensitive to irregular sampling, label leakage, and distribution drift across sites.

If you use a sequence model, define the input sequence carefully: fixed observation windows, masking rules, and explicit timestamps. Test against site-holdout splits, not only random splits, because random partitioning will exaggerate performance by leaking patient-level or workflow-level regularities. The same principle appears in scalable SDK design: an elegant interface is not enough if the system fails under real integration constraints.

4.3 Ensemble and hybrid approaches often win in practice

A pragmatic clinical architecture may combine a tabular risk model, an NLP signal extractor, and a rules-based safety layer. For example, the tabular model estimates baseline risk, the NLP pipeline adds evidence of infection concern, and a post-processing layer suppresses low-confidence alerts when recent charting suggests a transient spike rather than sustained decline. This hybrid approach can improve both sensitivity and operational precision.

Hybrid systems also make sense when you need guardrails. A rules-based “minimum criteria” gate can reduce absurd alerts, while the ML model focuses on ranking within plausible candidates. For implementation teams, this is closer to the layered approach recommended in safe rollback and observability patterns than to a pure black-box deployment.

5) Reducing false positives and tuning the operating threshold

5.1 False positives should be measured at the unit and shift level

Absolute false-positive rate is not enough. You should also measure alerts per 100 patient-hours, alerts per nurse shift, and alert burden by unit type. A model that looks acceptable in aggregate may still be unusable if it floods a single unit during predictable periods like post-op recovery or morning lab review. Clinical usefulness depends on the timing and distribution of alerts, not just their count.

This is where threshold tuning becomes a clinical systems problem. You may intentionally choose lower sensitivity in an environment where response capacity is constrained, or higher sensitivity in a high-acuity unit with established rapid-response workflows. Calibration curves, decision curves, and cost-sensitive analysis are more informative than ROC AUC alone. It is similar to the tradeoff analysis in utility-first procurement: the highest spec is not always the best value.

5.2 Use alert suppression logic and context windows

A recurring issue in sepsis detection is repeated alerts from the same patient over a short period. Without suppression logic, the model can generate a stream of nearly identical warnings, each technically justified but operationally redundant. A practical fix is a cooldown window or stateful alerting policy that escalates only when the risk meaningfully changes.

You can also suppress alerts in known noisy contexts, such as immediately after procedures, when charting is incomplete, or during transfers where vital sign artifacts are common. This does not mean hiding true danger; it means preventing the model from shouting during periods when clinicians already know the patient is unstable for non-sepsis reasons. For systems engineering analogies, see predictive maintenance and continuous self-checks.

5.3 Calibrate probabilities before exposing them to clinicians

Many teams train a good classifier and then expose raw probabilities that are poorly calibrated. That is a mistake in healthcare because clinicians interpret risk scores as meaningful estimates, not arbitrary model outputs. Temperature scaling, isotonic regression, and Platt scaling can improve calibration, but calibration should be rechecked by site and by subgroup.

Good calibration supports threshold tuning, shared decision-making, and communication with clinicians. It also helps avoid “all-or-nothing” interactions where a score slightly above threshold causes an alert that appears disproportionate to the clinical picture. A calibrated 0.72 is far more actionable than an uncalibrated 0.91 that merely reflects training-set quirks. Think of it as the reliability layer described in micro-answer design: precision in the output shape matters as much as the underlying content.

6) Explainability: making the model clinically legible

6.1 Clinicians need reasons, not only scores

Explainability in sepsis detection should answer a simple question: why is this patient flagged now? A useful explanation references recent physiologic changes, lab trends, and relevant note evidence rather than generic feature importance. If a clinician cannot map the explanation to bedside reality, trust will erode quickly even if the model is technically strong.

At minimum, provide patient-specific contributions at the time of alert, trend visualizations, and a plain-language summary of the major drivers. In a practical workflow, that might mean: rising heart rate over six hours, new hypotension, increased respiratory rate, elevated lactate, and note evidence of suspected infection. This approach mirrors the need for visual explainers in high-stakes reporting workflows, where the audience needs context, not abstraction.

6.2 Choose explainability methods that fit the model and the users

For tree-based models, SHAP values are often a good starting point because they provide local contributions and aggregate feature patterns. For sequence models, saliency maps, attention summaries, and feature attribution over time can be useful, though they can be misleading if interpreted naively. Whatever method you choose, validate that explanations are stable and clinically plausible across representative cases.

Do not present explanations as proof of causality. They are support tools, not causal evidence. A strong explanation strategy often includes multiple layers: global feature importance for governance, local patient-level explanations for bedside use, and cohort-level error analysis for quality improvement. This kind of layered trust is similar to how AI trust systems in community contexts balance transparency with utility.

6.3 Audit explanations for subgroup bias and failure modes

Explainability can fail silently if it behaves differently across subgroups. For example, the model may overemphasize age or comorbidity in one population while relying on acute physiologic change in another. That can lead to skewed alerting behavior that looks sensible on paper but embeds inequity in deployment. Audit feature attributions by age, sex, race/ethnicity, service line, and site.

Also inspect the explanations for shortcut learning. If the model repeatedly cites ICU transfer indicators, antibiotic orders, or late-stage documentation, it may be predicting the response to sepsis rather than sepsis onset itself. The point of explainability is not just interpretability; it is model debugging.

7) Validation design: retrospective, silent, and prospective studies

7.1 Use a layered validation ladder

A serious sepsis program should not jump from retrospective AUC to live deployment. Build a validation ladder: internal retrospective testing, temporal validation, site-external validation, silent prospective monitoring, and then clinician-interventional evaluation. Each stage answers a different question about generalization, calibration, and workflow impact.

Temporal validation is especially important because model drift is inevitable in healthcare. Coding practices change, lab order sets evolve, sepsis guidelines shift, and patient mix varies by season and staffing. A model that performs well on 2022 data may decay in 2026 if the practice environment has changed. This is one reason the broader market increasingly values integrations that support ongoing monitoring and re-calibration, rather than one-time deployment.

7.2 Silent mode is not optional

Before any bedside alerting, run the model in silent mode and compare predictions against actual outcomes while hiding outputs from clinicians. This phase helps surface calibration issues, alert burden, and per-site behavior without affecting care. It also reveals whether the model is unstable in certain wards, at certain times of day, or during periods of missing data.

Silent mode should be treated as a production rehearsal, not a checkbox. Track what the model would have alerted on, how often, and whether those alerts would have been actionable. For engineering teams, this is akin to staging and canary patterns in major software rollouts—you want evidence that the system behaves before users rely on it.

7.3 Prospective validation should measure process and outcome

A prospective validation study should not only report AUROC or AUPRC. It should also measure time to recognition, time to antibiotics, rapid response activation, ICU transfer, clinician response rate, alert acknowledgment rate, and false-alarm burden. If possible, include patient-centered outcomes such as length of stay and mortality, but be realistic about power and confounding.

Design the study with a clinical champion and an operations lead. Define who sees the alert, what they are expected to do, what escalation pathways exist, and how adherence will be tracked. If the model triggers an alert but the unit has no clear response protocol, the trial may fail for operational reasons rather than predictive ones.

8) Model drift, monitoring, and lifecycle management

8.1 Drift will happen: plan for it explicitly

Model drift in sepsis detection can come from changes in EHR schemas, lab panels, documentation habits, clinical protocols, coding incentives, or patient population. Because healthcare is dynamic, drift monitoring must be continuous rather than periodic. Track calibration drift, feature distribution drift, alert rate drift, and outcome drift separately, because they do not always move together.

Monitoring should be tied to action thresholds. For example, if calibration error rises beyond a predefined bound or if alert burden increases sharply without outcome improvement, trigger review and possible retraining. This is conceptually similar to observability in cross-system automations: if you can’t observe behavior clearly, you can’t safely automate decisions.

8.2 Governance requires retraining policies and versioning

Every model version should be tied to a feature specification, label definition, training window, validation cohort, and deployment date. Without rigorous versioning, post-hoc analysis becomes nearly impossible when clinicians ask why the alert pattern changed. Governance should also specify who approves retraining, when revalidation is required, and what thresholds for rollback are acceptable.

One effective pattern is to maintain a model registry with reproducible training artifacts and evaluation reports. Pair that with scheduled review meetings involving data science, clinical leadership, informatics, and quality improvement. This kind of formal lifecycle management is not unlike the guardrails in enterprise feature prioritization frameworks, where decisions must align with business and risk constraints.

8.3 Monitor subgroup performance and workflow impact

A model can drift unevenly. It may remain accurate overall while becoming worse for certain units, age groups, or admission pathways. Monitoring subgroup metrics is essential for safety and for maintaining fairness in alerting. If a model benefits one population more than another, that imbalance should be visible quickly.

Also monitor downstream workflow effects: number of alerts accepted, override rate, median response time, clinician satisfaction, and alert fatigue. Clinical utility is a systems property, not a point metric. The best monitoring programs borrow from resilient content operations: feedback loops, trend analysis, and active adaptation matter more than isolated numbers.

9) A practical comparison of modeling and validation choices

The right approach depends on your stage, data quality, and clinical goal. The table below summarizes common tradeoffs across model families and validation strategies. Use it as a planning tool, not a doctrine. In many programs, the best answer is a hybrid of approaches over time.

Approach	Strengths	Weaknesses	Best Use Case	Operational Risk
Logistic regression	Simple, fast, highly interpretable	Limited nonlinear modeling	Baseline risk scoring and governance-friendly deployment	Low
Gradient-boosted trees	Strong tabular performance, handles missingness well	Less intuitive than regression, still needs calibration	Structured EHR features with engineered time windows	Low to medium
RNN / temporal CNN	Captures sequential patterns and trend dynamics	Harder to explain, sensitive to missingness and drift	High-frequency vitals with rich temporal history	Medium to high
Transformer-based time-series model	Flexible long-range context and masking	Compute-heavy, data-hungry, complex validation	Large multi-site datasets with consistent streaming inputs	High
Hybrid ML + rules layer	Better operational control, suppresses obvious noise	More system complexity and maintenance	Production alerting where false positives are expensive	Medium
Silent prospective validation	Measures real-world behavior without influencing care	Doesn’t prove outcome benefit alone	Pre-launch rehearsal and drift assessment	Low
Interventional clinical trial	Tests true workflow and outcome impact	Requires coordination, power, and governance	Evidence generation for adoption and scale-up	Medium

10) Designing a clinician-partnered prospective trial

10.1 Define the intervention clearly

A clinical trial is not just “turning on the model.” Specify who receives the alert, when it fires, what evidence is displayed, whether alerts are interruptive or passive, and what action is expected. If the intervention is vague, outcome interpretation will be impossible because no one will know what exactly changed.

It helps to write the alert pathway as a protocol artifact: detection, presentation, acknowledgment, triage, escalation, and follow-up. That level of clarity is what turns a prediction model into a clinical tool. Borrowing from safety-critical operator workflows, the system should be designed around coordinated roles, not just a technology event.

10.2 Choose a trial design that matches adoption risk

Common designs include stepped-wedge cluster randomized trials, cluster randomized trials by unit, interrupted time series, and before/after quasi-experiments. If clinical buy-in is strong and units can be randomized, cluster designs provide stronger evidence. If operational realities make randomization difficult, carefully controlled phased rollouts can still produce useful evidence, provided you model secular trends.

Whatever design you choose, involve statisticians and clinical informatics experts early. Decide how alert exposure, crossover, and adherence will be handled. If your study is underpowered or confounded by protocol changes, it may fail to answer the central question even if the model is excellent.

10.3 Pre-register endpoints and safety outcomes

Pre-registration is a trust multiplier. Define primary endpoints, secondary endpoints, subgroup analyses, and safety metrics before the study begins. Safety outcomes might include delayed care, excess alert burden, and inappropriate escalation. If the model could plausibly cause harm through over-alerting or missed events, those risks must be monitored explicitly.

Clinical teams are more likely to engage when the trial shows respect for operational burden and patient safety. This is one reason products with transparent governance and evaluation tend to win adoption, much like well-structured budget planning wins over ad hoc spending.

11) A deployment checklist for ML engineers

11.1 Data and feature checklist

Confirm that every feature is timestamped, available at prediction time, and reproducible from the source tables. Validate units, ranges, outliers, and duplicate events. Build a feature dictionary that specifies whether each feature is static, rolling, delta-based, or note-derived, and whether it is available across all deployment sites.

Also document missingness handling and leakage controls. If your training pipeline joins future information, you may create a model that looks excellent in validation and then fails in production. This is a classic failure mode in data-rich systems, and one reason teams should borrow practices from structured metadata pipelines.

11.2 Model and calibration checklist

Evaluate discrimination, calibration, and decision-curve utility. Use site-holdout and temporal holdout splits, not just random splits. Then calibrate per site if necessary, and document whether the threshold is fixed or adaptive. If your alert policy changes by service line or unit, make that explicit in the deployment spec.

Before launch, run a usability review with bedside clinicians. Ask whether the alert is understandable, whether the suggested action is appropriate, and whether the frequency is tolerable. A technically good model that clinicians ignore is not a successful model.

11.3 Monitoring and rollback checklist

Implement dashboards for calibration drift, alert volume, acknowledgment rate, and outcome trends. Create rollback criteria in advance and ensure a human can disable or downgrade alerts quickly if the system misbehaves. Retain model version history and feature definitions so post-incident review is possible.

For resilient release management, use the same mindset recommended in simulation-backed CI pipelines: test, observe, and only then scale.

Pro Tip: In sepsis detection, the highest-value improvement is often not a fancier model but a better alert policy. A well-calibrated threshold, a suppression window, and a clear clinician action path can outperform a more complex model that floods the unit with noise.

12) The bottom line: useful sepsis ML is a clinical system, not a leaderboard

12.1 Build for actionability, not just accuracy

Sepsis detection systems succeed when they change care early enough to matter. That means your model must be trained on clinically meaningful data, calibrated for the actual environment, explainable enough to earn trust, and embedded into workflows that enable action. If any of those pieces are missing, the system may still produce impressive offline scores while failing in practice.

The most effective teams treat model development as one component of a larger intervention: data pipeline, feature design, alerting policy, clinician UX, monitoring, and continuous revalidation. That systems view is how robust products are built across industries, from automation to AI-assisted publishing.

12.2 Treat validation as an ongoing contract

Prospective validation is not the final box to check; it is the start of a lifecycle. Clinical environments change, and the model must be monitored, recalibrated, and sometimes retired. If you plan for drift from the outset and create governance that includes clinicians, your chances of sustained utility rise dramatically.

For teams building toward production, the key is to keep one question front and center: would a clinician act differently, and would that action likely help the patient? If you can answer yes with evidence, you are on the right track.

Building reliable cross-system automations - Great reference for observability, testing, and rollback discipline.
Mitigating bad data in automated systems - Useful analogies for noisy, incomplete healthcare feeds.
Integrating simulators into CI - Practical model for pre-deployment validation.
Design micro-answers for discoverability - Helpful for structuring concise clinician-facing outputs.
Prioritizing enterprise features with market intelligence - Strong governance framework for product roadmaps.

FAQ

What is the best prediction horizon for sepsis detection?

It depends on the clinical workflow and response capacity. Short horizons like 3–6 hours often improve specificity and actionability, while longer horizons provide more lead time but can be less precise.

Should we use deep learning or gradient-boosted trees?

Start with strong baselines such as logistic regression or gradient-boosted trees on engineered features. Move to deep learning only if sequence complexity, data scale, and deployment requirements justify it.

How do we reduce false positives without missing true cases?

Use calibrated probabilities, threshold tuning with clinicians, alert suppression windows, and unit-specific operating points. Also measure alert burden per shift, not just aggregate accuracy.

How important is NLP in sepsis models?

NLP can add valuable context from notes, especially for suspected infection, deterioration, and clinician concern. But it must be timestamped carefully to avoid leakage and should complement structured vitals and labs.

What should a prospective validation study include?

It should include a clear intervention protocol, clinician involvement, pre-registered endpoints, safety metrics, and a design that measures workflow and outcome impact—not just prediction performance.