Confident and wrong: what a hand fracture study reveals about AI’s most dangerous failure mode

Evidence-reviewed. Citations throughout.

The most dangerous thing about a wrong answer is not that it is wrong. It is that it sounds confident. In orthopaedic practice, this problem has a name everyone recognises: the missed scaphoid. A normal-looking X-ray. An unremarkable report. A patient who returns six weeks later with avascular necrosis and a question nobody wants to answer. The reason scaphoid fractures generate claims and end careers is not that they are easy to miss — it is that they look easy to exclude.

A recent study testing multimodal large language models on hand fracture detection has surfaced exactly this failure mode at the AI level, and the findings are worth examining carefully before these tools get anywhere near your on-call workflow.


The study

Güler et al. (2026, Diagnostics; PMID 41681742) evaluated four multimodal LLMs — GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1 — on their ability to detect hand and wrist fractures from plain radiographs. The task is clinically relevant: hand and wrist imaging makes up a significant proportion of emergency and on-call orthopaedic workload, and multimodal models capable of processing images alongside text are now sufficiently capable to be discussed seriously as clinical decision support tools.

Sixty-five adult patients with confirmed hand fractures were included: 30 phalangeal, 30 metacarpal, and 5 scaphoid. Each image was independently analysed five times per model using identical zero-shot prompts — 1,300 total inferences. Performance varied significantly across models and, critically, the relationship between accuracy and confidence differed in ways that matter clinically.


The results — and why they require careful reading

GPT-5 Pro performed best: 64.3% accuracy with strong intra-model consistency (Fleiss’ κ = 0.71). Gemini 2.5 Pro followed: 56.9% accuracy, κ = 0.57. These figures are modest, but the failure pattern is at least partially coherent — what the model gets wrong, it gets wrong consistently enough that the errors might be characterised and anticipated.

Mistral Medium 3.1 showed a different and more alarming pattern: accuracy of only 38.5%, but extremely high intra-model agreement (κ = 0.88). The model was consistently giving the same answers — and consistently wrong. The paper describes this explicitly as “confident hallucination”: systematic bias rather than random error. A tool that confidently reproduces the same incorrect answer every time it is asked is not just unhelpful — it creates a false impression of reliability.

Claude Sonnet 4.5 showed a third pattern: low accuracy (33.8%) combined with low consistency (κ = 0.33), reflecting instability — different wrong answers on repeated runs of the same image. Neither accurate nor reliable.


Scaphoid performance specifically

Scaphoid fractures were challenging across all four models. This is unsurprising technically — scaphoid waist fractures on plain film can appear as a subtle lucency, and a proportion are genuinely radiographically occult at presentation, requiring MRI for definitive diagnosis. The challenge for a model integrating image with zero-shot text prompt is precisely the challenge for a human: subtle, context-dependent, requiring the kind of probabilistic clinical reasoning — “mechanism of injury plus anatomical snuffbox tenderness plus borderline film equals treat as fracture” — that current LLMs are not reliably equipped to replicate.

The difference is that an experienced clinician knows when to hedge. The models, especially Mistral, frequently did not.


Why the confidence calibration problem matters

A model that says “I cannot confidently exclude a scaphoid fracture — clinical correlation and MRI recommended” is a useful safety net. A model that says “no acute bony injury identified, carpal alignment maintained” — with κ = 0.88 consistency — provides false reassurance in language that mimics a reliable radiological report. The format and certainty of the output are indistinguishable from a correct one, which is precisely what makes it dangerous in a clinical context.

This failure mode maps directly onto the orthopaedic scenarios that cause the most harm when missed. Scaphoid fractures in young adults. Occult neck of femur fractures in elderly patients with a painful hip and a normal film. Stress fractures in athletes. These are not rare edge cases — they are the diagnostic problems where the consequence of a missed diagnosis is most severe, and where a confident-sounding wrong answer does the most damage.


The practical position

With overall accuracy under 65% even from the best-performing model, and with confidence calibration failing in distinct and clinically dangerous ways across the others, multimodal LLMs are not ready for fracture detection in high-stakes clinical settings. The study is clear: these should be regarded as experimental diagnostic reasoning tools, not reliable standalone systems.

If a multimodal AI tool reports no fracture in a patient with anatomical snuffbox tenderness and a consistent mechanism, the clinical answer remains what it has always been: treat as a scaphoid fracture until proven otherwise by MRI. The AI output adds nothing to that decision and, if weighted inappropriately, risks subtracting from it.

Confident and wrong is the failure mode to watch for. This study shows the current generation of models has not solved it — and that some have made it worse.


References

  1. Güler I, et al. Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs. Diagnostics (Basel). 2026;16(3):424. PMID 41681742. https://doi.org/10.3390/diagnostics16030424

Leave a comment