Provider Education · April 22, 2026
HealthBench Professional
Evaluating Large Language Models on Real Clinician Chats
OpenMFM review of Soskin Hicks, Trofimov, Singhal et al. (OpenAI)
Context
AI Use Among Clinicians Has Doubled
Millions
Clinicians using ChatGPT weekly
2×
Growth in clinician usage over the past year
Limited
Benchmarks reflecting real multi-turn clinical workflows
Gap: Most existing benchmarks rely on single-turn, multiple-choice, or synthetic vignettes — not real clinician-model conversations.
The Benchmark
Introducing HealthBench Professional
📋
Meaningful
Tasks reflect real clinician workflows at the point of care, directly impacting care delivery.
✅
Trustworthy
Every example adjudicated by three or more physicians across three review phases.
🎯
Challenging
Difficult cases enriched ~3.5× via stratified sampling; preserves headroom for future models.
525
Final benchmark tasks
15,079
Candidate examples reviewed
190
Physician contributors across 50 countries
Benchmark Structure
Three Core Clinical Use Cases
| Use Case |
Description |
Share of Benchmark |
| Care Consult |
Reasoning through differentials, management, and treatment decisions |
48.8% (n = 256) |
| Writing & Documentation |
Note generation, summarization, coding, patient messaging |
27.0% (n = 142) |
| Medical Research |
Finding and synthesizing evidence relevant to clinical questions |
14.9% (n = 78) |
28 medical specialties represented — top 10 specialties account for 57.7% of examples.
Data Collection
Good Faith & Red Teaming
🩺
Good Faith
Physicians using ChatGPT for Clinicians in routine clinical, academic, and research work. Provides broad coverage of realistic professional workflows.
⚠️
Red Teaming
Deliberate adversarial stress-testing to surface errors, edge cases, and failure modes. Approximately one-third of the final benchmark.
1
Physician Authorship
Creates conversation, writes rubric, assigns Likert 1–7 rating
2
Physician Review
Independent review for realism, difficulty, and rubric quality
3
Final Adjudication
Resolves ambiguities; confirms clinical accuracy and relevance
Baseline
The Human Physician Reference Standard
🔬
Specialty-Matched
Each task assigned to a physician with relevant specialty or subspecialty expertise.
⏱️
Unbounded Time
No time constraint; physicians could use internet and standard medical databases.
🚫
No AI Assistance
Responses written without AI tools to represent a true human upper bound.
Purpose: Establishes what an ideal, well-resourced physician response looks like — the gold standard against which AI systems are measured.
Methodology
Rubric-Based Scoring
| Component |
Detail |
| Rubric Criteria |
Physician-written; each criterion independently evaluated |
| Point Range |
−10 to +10 per criterion; negative points for unsafe or harmful behavior |
| Length Adjustment |
Penalty of 1.47 pts per 500 characters above 2,000 to prevent verbosity gaming |
| Grader Model |
GPT-5.4 at low reasoning effort (more capable than prior HealthBench grader) |
| Samples per Example |
8 samples averaged to reduce variability in main comparisons |
Score interpretation: Scores reported ×100 for readability. A score of 45 on HealthBench Professional can coexist with high real-world performance in typical usage.
Results
Overall Benchmark Performance
GPT-5.4 in ChatGPT for Clinicians
Key finding: GPT-5.4 in ChatGPT for Clinicians significantly outperforms human physician baseline (p = 3.7 × 10⁻¹⁰).
Results by Domain
Performance by Use Case
| Use Case |
ChatGPT for Clinicians |
Human Physicians |
Base GPT-5.4 |
Significance |
| Care Consult |
51.0 |
42.7 |
~51.0 |
p = 0.025 |
| Writing & Docs |
64.1 |
32.1 |
34.6 |
p < 10⁻⁸ |
| Medical Research |
67.0 |
56.3 |
58.1 |
p = 0.0015 |
Largest gain: Writing & Documentation — ChatGPT for Clinicians scored nearly 2× the physician baseline (64.1 vs. 32.1).
Adversarial Robustness
Performance on Hardest Cases
ChatGPT for Clinicians (RT Likert 1–2)
Human Physicians (RT Likert 1–2)
Base GPT-5.4 (RT Likert 1–2)
Good Faith Typical
ChatGPT for Clinicians: 69.0 vs. Physicians: 55.7 (p = 1.0 × 10⁻⁷)
Good Faith Difficult
Statistically tied with physicians (p = 0.60) — a genuine challenge for all systems.
Test-Time Scaling
Reasoning Effort Improves Scores
GPT-5.4: Low → X-High Reasoning
+5.6 – 7.3
Points gained by increasing reasoning effort, depending on verbosity setting
Key insight: Gains are driven by higher-quality responses — not simply longer ones. Length adjustment confirms this.
Average Gain: Low → High Reasoning
+3.3
Average score improvement across GPT-5 through GPT-5.4 models
Verbosity effect: Low → medium verbosity adds ~1,300 characters but only 3.9 unadjusted points, which the length penalty largely neutralizes.
Critical Appraisal
Limitations & Interpretation
📊
Scores ≠ Real-World Rate
Adversarial enrichment means benchmark scores are intentionally lower than typical clinical performance.
🏥
EHR Workflows Not Covered
Institution-specific constraints and EHR-integrated workflows are absent from the current benchmark.
🔬
OpenAI Model Selection Bias
Difficulty ratings assigned against OpenAI frontier models; may partially confound cross-model comparisons.
Recommendation: Institutions should conduct pre-deployment evaluations for their specific use cases and monitor post-deployment performance.
Conclusion
Raising the Ceiling of Care
Today
On HealthBench Professional, GPT-5.4 in ChatGPT for Clinicians outperformed the physician baseline, with the largest gains in documentation and research tasks.
Benchmark Value
HealthBench Professional provides a rigorous, unsaturated measure to track frontier model progress in clinical AI.
Guardrail
Benchmark wins are not deployment clearance. Local evaluation, workflow testing, and physician accountability still determine whether a model belongs in clinical practice.
Sources: OpenAI, “Making ChatGPT better for clinicians,” April 22, 2026, plus the HealthBench Professional paper and dataset release.