98% accurate. 120 patients. Why you should be more sceptical, not less.

Evidence-reviewed. Citations throughout.

A paper lands in your inbox. A machine learning model for distinguishing spinal tuberculosis from pyogenic vertebral infection and spinal metastasis on MRI. Accuracy: 98.3%. The abstract is confident. The supplementary figures are polished. Your procurement team is interested. Before you endorse it, do some arithmetic.

This paper (PMID 42082966) is an instructive case study in what near-perfect AI performance in a small dataset should actually tell you — and it is not what it looks like.


The clinical problem is real

The diagnostic question matters. Distinguishing spinal tuberculosis from pyogenic infection from neoplastic vertebral involvement on MRI is genuinely difficult. The imaging features overlap in ways that challenge experienced spinal surgeons and radiologists. Treatment pathways diverge fundamentally: anti-tubercular chemotherapy for TB, appropriate antibiotics and possibly surgical decompression for pyogenic infection, oncological management for metastasis. Getting it wrong has serious consequences, and the diagnostic challenge is particularly acute in high-prevalence TB settings and in patients presenting with non-specific back pain without a clear infective history.

A reliable imaging-based AI tool for this differentiation would have genuine clinical value. That is not the question. The question is whether this model actually provides it.


The numbers

Dataset: approximately 120 patients across three diagnostic categories. Method: a convolutional neural network applied to MRI sequences, with data augmentation used to expand the training set. Accuracy: 98.3%. AUC: reported above 0.97 across categories.

These are extraordinary numbers. In clinical medicine, extraordinary claims warrant scrutiny proportional to how extraordinary they are.


Why data augmentation with small datasets is a warning sign

Data augmentation is a standard and legitimate technique in deep learning. It involves generating modified versions of existing images — rotations, flips, brightness adjustments, zooms — to artificially expand the training dataset and improve the model’s ability to generalise. Used appropriately with datasets of adequate size, it works.

The problem arises when augmentation is used to address a fundamentally insufficient dataset rather than to improve an adequate one. With 120 patients split across three diagnostic categories, you have roughly 40 cases per class. Once those are divided into training, validation, and test sets, the model is learning from a small number of genuinely distinct cases. Augmented copies of those cases are not new data — they are the same 40 patients with their brightness adjusted and their images rotated.

A model that sees the same patients repeatedly during training, in various augmented forms, will learn features of those specific patients very well. When tested on a held-out set that also contains augmented versions of the same original cases, performance will look excellent. This is not generalisation. It is a sophisticated form of overfitting, and the accuracy figure it produces is not informative about how the model will perform on the next 40 patients from a different scanner at a different institution.


The absence of external validation

The definitive test for any clinical AI model is external validation: does it perform on a new dataset, collected at a different centre, from patients who had no relationship to the training process?

This study did not report external validation. Without it, the 98.3% figure is not a claim about clinical performance — it is a claim about memorisation. The model has demonstrated that it can classify the patients it was essentially trained on. That is a necessary but wholly insufficient condition for clinical deployment.

This is not unusual in published AI literature. A systematic review by Nagendran et al. in the BMJ (2020; PMID 32213531) found that of 81 non-randomised deep learning studies in medical imaging, only 9 were prospective and just 6 were tested in a real-world clinical setting. Risk of bias was high in 58 of 81 studies, and the majority made claims that AI performance was at least comparable to clinicians — yet few were designed with the methodological rigour needed to support that claim. External validation was the exception rather than the rule.


What 98% accuracy in 120 patients actually means

It means two or three misclassified patients in the test set. The difference between 98.3% and 96% accuracy in a cohort of 120 is one patient. These figures are not statistically distinguishable at this sample size — the confidence intervals are wide enough to be consistent with true performance anywhere between approximately 90% and 99%.

The number 98.3% implies a precision that the dataset cannot support. It is a product of the reporting convention, not of the underlying evidence.


The principle

When a paper like this comes across your desk — or when a vendor cites it to support a procurement decision — the questions to ask are: how many patients, how many from external sites, and what did performance look like on external validation? If external validation was not performed, the headline accuracy figure is not a basis for clinical adoption.

Near-perfect performance in small datasets is not evidence of a good model. It is, in many cases, evidence of an overfitted one. The more impressive the number, the more important it becomes to ask how it was produced.

Scepticism here is not technophobia. It is the same critical appraisal you would apply to any surgical technique paper with implausible results and a sample size of 120. The standard should not drop because the authors used a neural network.


References

  1. Sangsin A, et al. Classification of spinal tuberculous infection, pyogenic infection and spinal metastasis from magnetic resonance imaging using machine learning. BMC Musculoskelet Disord. 2026. PMID 42082966. https://doi.org/10.1186/s12891-026-09838-2
  2. Nagendran M, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. PMID 32213531. https://doi.org/10.1136/bmj.m689

Leave a comment