Provider Education · April 22, 2026

HealthBench Professional

Evaluating Large Language Models on Real Clinician Chats

OpenMFM review of Soskin Hicks, Trofimov, Singhal et al. (OpenAI)

Context

AI Use Among Clinicians Has Doubled

Millions

Clinicians using ChatGPT weekly

2×

Growth in clinician usage over the past year

Limited

Benchmarks reflecting real multi-turn clinical workflows

Gap: Most existing benchmarks rely on single-turn, multiple-choice, or synthetic vignettes — not real clinician-model conversations.

The Benchmark

Introducing HealthBench Professional

📋

Meaningful

Tasks reflect real clinician workflows at the point of care, directly impacting care delivery.

✅

Trustworthy

Every example adjudicated by three or more physicians across three review phases.

🎯

Challenging

Difficult cases enriched ~3.5× via stratified sampling; preserves headroom for future models.

525

Final benchmark tasks

15,079

Candidate examples reviewed

190

Physician contributors across 50 countries

Benchmark Structure

Three Core Clinical Use Cases

Use Case	Description	Share of Benchmark
Care Consult	Reasoning through differentials, management, and treatment decisions	48.8% (n = 256)
Writing & Documentation	Note generation, summarization, coding, patient messaging	27.0% (n = 142)
Medical Research	Finding and synthesizing evidence relevant to clinical questions	14.9% (n = 78)

28 medical specialties represented — top 10 specialties account for 57.7% of examples.

Data Collection

Good Faith & Red Teaming

🩺

Good Faith

Physicians using ChatGPT for Clinicians in routine clinical, academic, and research work. Provides broad coverage of realistic professional workflows.

⚠️

Red Teaming

Deliberate adversarial stress-testing to surface errors, edge cases, and failure modes. Approximately one-third of the final benchmark.

1

Physician Authorship

Creates conversation, writes rubric, assigns Likert 1–7 rating

2

Physician Review

Independent review for realism, difficulty, and rubric quality

3

Final Adjudication

Resolves ambiguities; confirms clinical accuracy and relevance

Baseline

The Human Physician Reference Standard

🔬

Specialty-Matched

Each task assigned to a physician with relevant specialty or subspecialty expertise.

⏱️

Unbounded Time

No time constraint; physicians could use internet and standard medical databases.

🚫

No AI Assistance

Responses written without AI tools to represent a true human upper bound.

Purpose: Establishes what an ideal, well-resourced physician response looks like — the gold standard against which AI systems are measured.

Methodology

Rubric-Based Scoring

Component	Detail
Rubric Criteria	Physician-written; each criterion independently evaluated
Point Range	−10 to +10 per criterion; negative points for unsafe or harmful behavior
Length Adjustment	Penalty of 1.47 pts per 500 characters above 2,000 to prevent verbosity gaming
Grader Model	GPT-5.4 at low reasoning effort (more capable than prior HealthBench grader)
Samples per Example	8 samples averaged to reduce variability in main comparisons

Score interpretation: Scores reported ×100 for readability. A score of 45 on HealthBench Professional can coexist with high real-world performance in typical usage.

Results

Overall Benchmark Performance

GPT-5.4 in ChatGPT for Clinicians

59.0

GPT-5.4 (base)

48.1

GPT-5.4 + Browsing

45.8

Human Physicians

43.7

Claude Opus 4.7

~40

Key finding: GPT-5.4 in ChatGPT for Clinicians significantly outperforms human physician baseline (p = 3.7 × 10⁻¹⁰).

Results by Domain

Performance by Use Case

Use Case	ChatGPT for Clinicians	Human Physicians	Base GPT-5.4	Significance
Care Consult	51.0	42.7	~51.0	p = 0.025
Writing & Docs	64.1	32.1	34.6	p < 10⁻⁸
Medical Research	67.0	56.3	58.1	p = 0.0015

Largest gain: Writing & Documentation — ChatGPT for Clinicians scored nearly 2× the physician baseline (64.1 vs. 32.1).

Adversarial Robustness

Performance on Hardest Cases

ChatGPT for Clinicians (RT Likert 1–2)

55.8

Human Physicians (RT Likert 1–2)

30.0

Base GPT-5.4 (RT Likert 1–2)

26.2

Good Faith Typical

ChatGPT for Clinicians: 69.0 vs. Physicians: 55.7 (p = 1.0 × 10⁻⁷)

Good Faith Difficult

Statistically tied with physicians (p = 0.60) — a genuine challenge for all systems.

Test-Time Scaling

Reasoning Effort Improves Scores

GPT-5.4: Low → X-High Reasoning

+5.6 – 7.3

Points gained by increasing reasoning effort, depending on verbosity setting

Key insight: Gains are driven by higher-quality responses — not simply longer ones. Length adjustment confirms this.

Average Gain: Low → High Reasoning

+3.3

Average score improvement across GPT-5 through GPT-5.4 models

Verbosity effect: Low → medium verbosity adds ~1,300 characters but only 3.9 unadjusted points, which the length penalty largely neutralizes.

Critical Appraisal

Limitations & Interpretation

📊

Scores ≠ Real-World Rate

Adversarial enrichment means benchmark scores are intentionally lower than typical clinical performance.

🏥

EHR Workflows Not Covered

Institution-specific constraints and EHR-integrated workflows are absent from the current benchmark.

🔬

OpenAI Model Selection Bias

Difficulty ratings assigned against OpenAI frontier models; may partially confound cross-model comparisons.

Recommendation: Institutions should conduct pre-deployment evaluations for their specific use cases and monitor post-deployment performance.

Conclusion

Raising the Ceiling of Care

Today

On HealthBench Professional, GPT-5.4 in ChatGPT for Clinicians outperformed the physician baseline, with the largest gains in documentation and research tasks.

Benchmark Value

HealthBench Professional provides a rigorous, unsaturated measure to track frontier model progress in clinical AI.

Guardrail

Benchmark wins are not deployment clearance. Local evaluation, workflow testing, and physician accountability still determine whether a model belongs in clinical practice.

Sources: OpenAI, “Making ChatGPT better for clinicians,” April 22, 2026, plus the HealthBench Professional paper and dataset release.