What do we actually mean when we say AI in orthopaedics?

A paper lands in your inbox. The headline: a deep learning algorithm achieves 94% accuracy in fracture detection, outperforming both radiologists and orthopaedic surgeons. A colleague forwards it with a one-line message: “thoughts?”

Whether your reaction is excitement, scepticism, or indifference, the answer depends on understanding what kind of system this actually is. Because “AI in orthopaedics” currently covers everything from a rule-based scoring tool to a neural network trained on half a million radiographs. The term has become near-useless as a descriptor, and evaluating the claims attached to it requires a bit more than reading the abstract.

Narrow AI is what clinical orthopaedics actually has

Every AI system currently deployed in orthopaedic care is narrow AI: trained to perform one specific task, on a specific type of data, in a specific context. It does not generalise. A fracture detection model trained on anteroposterior pelvis radiographs from a single tertiary centre has learned patterns from that dataset. Put it in front of a different image type, a different patient population, or a different clinical question, and its performance degrades — sometimes significantly.

General artificial intelligence — systems capable of flexible reasoning across domains — does not exist in clinical practice. What we have is a collection of narrow tools, each performing its assigned task with varying degrees of competence.

A model described as AI may be a convolutional neural network performing image classification, a random forest predicting postoperative complications from structured registry data, or a large language model summarising discharge letters. These are architecturally different, trained differently, and fail differently. The word AI does not tell you which you are dealing with.

How machine learning works — enough to be useful

Machine learning models learn patterns from data rather than following rules written by a programmer. A deep learning model for fracture detection is not programmed with rules about what a fracture looks like. It is shown thousands of labelled radiographs — fracture or no fracture — and adjusts its internal parameters until it can reliably distinguish between the two.

The consequence is that the model’s performance is determined by what it was trained on. A model trained on high-quality, well-annotated data from a large diverse population will generalise better than one trained on a single institution’s data over eighteen months. It is also only as good as its labels — if the training data contains errors, the model learns from them.

Kuo et al. (2021) observed that many early clinical AI papers failed to clearly distinguish between performance on training data and performance on genuinely unseen data — a distinction that matters enormously in practice. A model that appears to perform perfectly may simply have learned its training set.

The same label, very different conclusions

A 2023 systematic review and meta-analysis in JAMA Network Open examined 39 studies of AI applied to hip fractures — for both radiographic diagnosis and postoperative outcome prediction. The diagnostic findings were reasonable: AI models achieved mean sensitivity of 89.3% and specificity of 87.5%, comparable to expert clinicians. The outcome prediction findings were more instructive. ML models for mortality prediction achieved a mean AUC of 0.84 — compared with 0.79 for traditional multivariable regression. The difference was not statistically significant (Lex et al., 2023).

For fracture detection: a narrow AI performing at a clinically useful level. For outcome prediction: barely outperforming a logistic regression that a statistician could build in an afternoon. Same label, very different conclusions.

Shah et al. (2022) provide a surgeon-oriented guide to reading AI studies in orthopaedics, noting that external validation — testing a model on data from a different institution entirely — is frequently absent from the published literature. Internal validation, where a portion of the original dataset is held back for testing, is not the same thing.

Three questions

When you encounter an AI claim — in a journal, at a conference, in a vendor pitch — three questions cut through the noise:

What was this trained on, and how different is that population from your patients?
Was it externally validated, or only tested on a held-out portion of the training dataset?
What is the comparator? Better than chance is not the bar. Better than a clinician with the same information is.

The next time a headline claims AI outperforms surgeons, those three questions will tell you more than the abstract.

References

Kuo RYL, Harrison CJ, Jones BE, Geoghegan L, Furniss D. Perspectives: A surgeon’s guide to machine learning. Int J Surg. 2021;94:106133. https://doi.org/10.1016/j.ijsu.2021.106133
Lex JR, Di Michele J, Koucheki R, Pincus D, Whyne C, Ravi B. Artificial Intelligence for Hip Fracture Detection and Outcome Prediction: A Systematic Review and Meta-analysis. JAMA Netw Open. 2023;6(3):e233391. https://doi.org/10.1001/jamanetworkopen.2023.3391
Shah RM, Wong C, Arpey NC, Patel AA, Divi SN. A Surgeon’s Guide to Understanding Artificial Intelligence and Machine Learning Studies in Orthopaedic Surgery. Curr Rev Musculoskelet Med. 2022;15(2):121–132. https://doi.org/10.1007/s12178-022-09738-7