The fracture that got lost in translation: LLMs and the limits of classification from text

Most trauma data is locked inside free-text radiology reports. Fractures get described in plain language — comminuted, displaced, intra-articular — but rarely formally classified. Outside academic centres, systematic OTA/AO coding at the point of reporting is the exception rather than the rule. Most registry work, audit, and outcome research ends up either manually re-coded or missing the classification altogether.

An LLM that could read those reports and apply the classification retrospectively would be a significant capability. A study published in OTA International in May 2026 tested whether current models can do it.


The study

The research team queried an LLM to classify 109 fracture descriptions from real deidentified radiology reports against expert traumatologist ground truth derived from the actual radiographs — not the reports alone. Three prompting strategies were evaluated: zero-shot (no specific guidance), zero-shot chain-of-thought (prompted to reason step by step), and retrieval-augmented generation (given access to the 2018 OTA/AO Classification Compendium as a reference).


Where it works and where it doesn’t

The results split clearly by granularity. At the type level — the broadest category, identifying which bone segment is fractured — all three strategies achieved what the study describes as “almost perfect agreement” with expert classification. At the group level (the next tier down), performance remained strong. At the subgroup level — the detailed morphological descriptor that shapes surgical planning — all three strategies fell to “slight agreement.”

The LLM could reliably identify a proximal femur fracture. It could not reliably tell you whether it was a 31-A1, a 31-A2, or a 31-A3. That distinction matters for surgical decision-making. It is the level at which the classification has clinical content.


Three failure modes

The study documented three specific categories of error: imprecise radiology descriptions that didn’t contain enough detail for classification; hallucinated information the LLM introduced that wasn’t in the report or the radiograph; and incorrect application of factually correct classification rules.

That third category is worth dwelling on. The model knew the rules. It applied them wrongly. This is a reasoning deficit, not a knowledge deficit. It is harder to detect than outright factual error, because the output looks structured and internally consistent even when the conclusion is wrong.

When the researchers provided the LLM with structured, precisely written ideal fracture descriptions instead of actual radiology reports, performance improved at all levels. Some of the subgroup failure sits in the LLM’s reasoning. A significant portion sits in the imprecision of real-world radiology text — the source material the algorithm has to work with.


What this means in practice

There are two implications pulling in different directions.

At the type and group level, current LLM performance may be adequate for bulk retrospective coding of large radiology report archives — generating searchable, structured datasets for research without requiring expert reclassification of every report. For that purpose, the performance demonstrated here is genuinely useful.

For individual clinical decisions, the picture is different. Subgroup classification from free text is unreliable. Hallucinations occur and are documented. An OTA/AO code at group level — knowing it is a femoral neck fracture — does not help a surgeon choosing between a cannulated screw and a total hip replacement. The subgroup distinction, at a level the model cannot reliably provide, is where the clinical decision lives.

LLMs for retrospective data labelling at scale: the evidence now supports cautious exploration. LLMs for individual case classification feeding clinical decisions: not yet, and this study gives specific evidence-based reasons why.

Know what level of classification you need before reaching for the tool. The gap between type-level and subgroup-level accuracy is where the clinical value either exists or doesn’t.


References

  1. Hu S, et al. Deriving the OTA/AO fracture classification from routinely collected radiology reports using a large language model. OTA Int. 2026;9(2):e482. PMID 42078173. https://doi.org/10.1097/OI9.0000000000000482

Leave a comment