Provider Education · April 22, 2026

HealthBench Professional

Evaluating Large Language Models on Real Clinician Chats

Context

AI Use Among Clinicians Has Doubled

Millions
Clinicians using ChatGPT weekly
Growth in clinician usage over the past year
Limited
Benchmarks reflecting real multi-turn clinical workflows
Gap: Most existing benchmarks rely on single-turn, multiple-choice, or synthetic vignettes — not real clinician-model conversations.
The Benchmark

Introducing HealthBench Professional

📋
Meaningful
Tasks reflect real clinician workflows at the point of care, directly impacting care delivery.
Trustworthy
Every example adjudicated by three or more physicians across three review phases.
🎯
Challenging
Difficult cases enriched ~3.5× via stratified sampling; preserves headroom for future models.
525
Final benchmark tasks
15,079
Candidate examples reviewed
190
Physician contributors across 50 countries
Benchmark Structure

Three Core Clinical Use Cases

Use Case Description Share of Benchmark
Care Consult Reasoning through differentials, management, and treatment decisions 48.8% (n = 256)
Writing & Documentation Note generation, summarization, coding, patient messaging 27.0% (n = 142)
Medical Research Finding and synthesizing evidence relevant to clinical questions 14.9% (n = 78)
28 medical specialties represented — top 10 specialties account for 57.7% of examples.
Data Collection

Good Faith & Red Teaming

🩺
Good Faith
Physicians using ChatGPT for Clinicians in routine clinical, academic, and research work. Provides broad coverage of realistic professional workflows.
⚠️
Red Teaming
Deliberate adversarial stress-testing to surface errors, edge cases, and failure modes. Approximately one-third of the final benchmark.
1
Physician Authorship
Creates conversation, writes rubric, assigns Likert 1–7 rating
2
Physician Review
Independent review for realism, difficulty, and rubric quality
3
Final Adjudication
Resolves ambiguities; confirms clinical accuracy and relevance
Baseline

The Human Physician Reference Standard

🔬
Specialty-Matched
Each task assigned to a physician with relevant specialty or subspecialty expertise.
⏱️
Unbounded Time
No time constraint; physicians could use internet and standard medical databases.
🚫
No AI Assistance
Responses written without AI tools to represent a true human upper bound.
Purpose: Establishes what an ideal, well-resourced physician response looks like — the gold standard against which AI systems are measured.
Methodology

Rubric-Based Scoring

Component Detail
Rubric Criteria Physician-written; each criterion independently evaluated
Point Range −10 to +10 per criterion; negative points for unsafe or harmful behavior
Length Adjustment Penalty of 1.47 pts per 500 characters above 2,000 to prevent verbosity gaming
Grader Model GPT-5.4 at low reasoning effort (more capable than prior HealthBench grader)
Samples per Example 8 samples averaged to reduce variability in main comparisons
Score interpretation: Scores reported ×100 for readability. A score of 45 on HealthBench Professional can coexist with high real-world performance in typical usage.
Results

Overall Benchmark Performance

GPT-5.4 in ChatGPT for Clinicians
59.0
GPT-5.4 (base)
48.1
GPT-5.4 + Browsing
45.8
Human Physicians
43.7
Claude Opus 4.7
~40
Key finding: GPT-5.4 in ChatGPT for Clinicians significantly outperforms human physician baseline (p = 3.7 × 10⁻¹⁰).
Results by Domain

Performance by Use Case

Use Case ChatGPT for Clinicians Human Physicians Base GPT-5.4 Significance
Care Consult 51.0 42.7 ~51.0 p = 0.025
Writing & Docs 64.1 32.1 34.6 p < 10⁻⁸
Medical Research 67.0 56.3 58.1 p = 0.0015
Largest gain: Writing & Documentation — ChatGPT for Clinicians scored nearly 2× the physician baseline (64.1 vs. 32.1).
Adversarial Robustness

Performance on Hardest Cases

ChatGPT for Clinicians (RT Likert 1–2)
55.8
Human Physicians (RT Likert 1–2)
30.0
Base GPT-5.4 (RT Likert 1–2)
26.2
Good Faith Typical
ChatGPT for Clinicians: 69.0 vs. Physicians: 55.7 (p = 1.0 × 10⁻⁷)
Good Faith Difficult
Statistically tied with physicians (p = 0.60) — a genuine challenge for all systems.
Test-Time Scaling

Reasoning Effort Improves Scores

+5.6 – 7.3

Points gained by increasing reasoning effort, depending on verbosity setting

Key insight: Gains are driven by higher-quality responses — not simply longer ones. Length adjustment confirms this.
+3.3

Average score improvement across GPT-5 through GPT-5.4 models

Verbosity effect: Low → medium verbosity adds ~1,300 characters but only 3.9 unadjusted points, which the length penalty largely neutralizes.
Critical Appraisal

Limitations & Interpretation

📊
Scores ≠ Real-World Rate
Adversarial enrichment means benchmark scores are intentionally lower than typical clinical performance.
🏥
EHR Workflows Not Covered
Institution-specific constraints and EHR-integrated workflows are absent from the current benchmark.
🔬
OpenAI Model Selection Bias
Difficulty ratings assigned against OpenAI frontier models; may partially confound cross-model comparisons.
Recommendation: Institutions should conduct pre-deployment evaluations for their specific use cases and monitor post-deployment performance.
Conclusion

Raising the Ceiling of Care

Today
On HealthBench Professional, GPT-5.4 in ChatGPT for Clinicians outperformed the physician baseline, with the largest gains in documentation and research tasks.
Benchmark Value
HealthBench Professional provides a rigorous, unsaturated measure to track frontier model progress in clinical AI.
Guardrail
Benchmark wins are not deployment clearance. Local evaluation, workflow testing, and physician accountability still determine whether a model belongs in clinical practice.

Sources: OpenAI, “Making ChatGPT better for clinicians,” April 22, 2026, plus the HealthBench Professional paper and dataset release.

← OpenMFM Library