Meta-analysisWikiZone 2High evidence score

RETRACTED ARTICLE: Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda

Authors: Yogesh Kumar, Apeksha Koul, Ruchi Singla, Muhammad Fazal Ijaz
Journal: Journal of Ambient Intelligence and Humanized Computing
Year: 2022
DOI: 10.1007/s12652-021-03612-z
Citations: 975

TL;DR

This systematic review of 219 papers on AI for disease diagnosis was retracted in 2026, meaning its conclusions — that AI achieves 85–99% accuracy across Alzheimer's, cancer, diabetes, heart disease, tuberculosis, stroke, hypertension, skin, and liver disease — cannot be relied upon, and for someone running a self-experiment, the key lesson is to never base personal health decisions on retracted or non-replicable AI research.

What they tested

The authors conducted a systematic literature review and meta-analysis examining how artificial intelligence techniques (machine learning and deep learning) are used to diagnose nine categories of disease:

Alzheimer's disease

Cancer (various types)

Diabetes

Chronic heart disease

Tuberculosis

Stroke and cerebrovascular disease

Hypertension

Skin disease

Liver disease

The "intervention" being studied was the application of AI algorithms to medical imaging data (ultrasound, MRI, mammography, CT scans, genomics) and other clinical data sources. The comparator was traditional diagnostic methods (human clinician diagnosis, standard laboratory tests, or conventional imaging interpretation). The outcome measures were standard AI performance metrics: prediction rate, accuracy, sensitivity, specificity, area under the curve (AUC), precision, recall, and F1-score.

The review aimed to synthesise findings across studies to determine whether AI-based diagnostic tools perform better than, equal to, or worse than conventional diagnostic approaches. The authors also proposed a "synthesizing framework" and "future research agenda" for how AI should be integrated into clinical diagnosis.

Who was studied

This is a systematic review and meta-analysis, so no human participants were directly enrolled. Instead, the authors analysed 219 individual studies published up to October 2020. The studies themselves covered a wide range of populations:

**Alzheimer's studies:** Primarily elderly patients (aged 60+) with cognitive impairment, drawn from memory clinics and hospital neurology departments. Sample sizes ranged from approximately 50 to 5,000 patients across studies.

**Cancer studies:** Patients with confirmed diagnoses of breast, lung, prostate, oral, skin, and liver cancers. Sample sizes ranged from 100 to over 10,000 patients. Imaging datasets included mammograms, CT scans, histopathology slides, and dermoscopy images.

**Diabetes studies:** Patients with Type 1 and Type 2 diabetes, plus healthy controls. Sample sizes ranged from 200 to 8,000 individuals. Data sources included retinal fundus images, electronic health records, and continuous glucose monitoring.

**Heart disease studies:** Patients with coronary artery disease, heart failure, and hypertension. Sample sizes ranged from 300 to 12,000 patients. Data included ECG signals, echocardiograms, and clinical lab values.

**Tuberculosis studies:** Patients with active TB, latent TB, and healthy controls, primarily from high-burden countries (India, China, South Africa). Sample sizes ranged from 100 to 3,000. Imaging data were chest X-rays and CT scans.

**Stroke studies:** Acute stroke patients and healthy controls, with sample sizes from 50 to 2,000. Data included brain CT and MRI scans.

**Hypertension studies:** Adults aged 18–80 with diagnosed hypertension and normotensive controls. Sample sizes ranged from 200 to 5,000. Data included blood pressure readings, ECG, and retinal images.

**Skin disease studies:** Patients with melanoma, basal cell carcinoma, squamous cell carcinoma, psoriasis, eczema, and other dermatological conditions. Sample sizes ranged from 100 to 10,000. Data were primarily dermoscopic images.

**Liver disease studies:** Patients with hepatitis-induced liver disease, cirrhosis, fatty liver disease, and hepatocellular carcinoma. Sample sizes ranged from 100 to 4,000. Data included ultrasound, CT, MRI, and blood biomarkers.

The review included studies from multiple countries (USA, UK, India, China, South Korea, Germany, Japan, Brazil, and others), but the authors did not provide a detailed demographic breakdown of the combined participant pool across all studies.

How they measured it

The authors extracted performance metrics from each included study. The key instruments and scales were:

**Accuracy:** Percentage of correct predictions (true positives + true negatives) divided by total predictions. Range 0–100%, higher = better.

**Sensitivity (Recall):** True positive rate — proportion of actual disease cases correctly identified. Range 0–100%, higher = better.

**Specificity:** True negative rate — proportion of healthy cases correctly identified as disease-free. Range 0–100%, higher = better.

**Area Under the Curve (AUC):** A single number summarising the receiver operating characteristic (ROC) curve. Range 0.5 (no better than chance) to 1.0 (perfect discrimination). Typically, AUC > 0.8 is considered good, > 0.9 excellent.

**Precision (Positive Predictive Value):** Proportion of positive predictions that were correct. Range 0–100%, higher = better.

**F1-score:** Harmonic mean of precision and recall. Range 0–100%, higher = better. Useful when class imbalance exists (e.g., rare diseases).

**Prediction rate:** A less standardised metric, generally referring to the proportion of cases where the AI model made a confident prediction (above a threshold).

The authors also extracted information about the type of AI model used (convolutional neural networks, support vector machines, random forests, artificial neural networks, etc.), the imaging modality, the dataset size, and whether the study used cross-validation or held-out test sets.

Importantly, the authors did not conduct their own statistical meta-analysis with pooled effect sizes and confidence intervals. Instead, they reported ranges of performance across studies and compared them qualitatively. This is a significant methodological limitation.

Methodology

**Study design:** This is a systematic literature review with a qualitative synthesis (not a formal meta-analysis with pooled statistics). The authors followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, which is the standard approach for conducting and reporting systematic reviews.

**Search strategy:** The authors searched six databases: Web of Science, Scopus, Google Scholar, PubMed, Excerpta Medical Database (Embase), and Psychology Information (PsycINFO). They included articles published up to October 2020. The search terms covered combinations of "artificial intelligence," "machine learning," "deep learning," and the nine disease categories.

**Inclusion criteria:** Studies were included if they:

Used AI techniques for disease diagnosis

Reported at least one performance metric (accuracy, sensitivity, specificity, AUC, precision, recall, F1-score)

Were published in English

Were peer-reviewed journal articles or conference proceedings

**Exclusion criteria:** The authors excluded:

Review articles, editorials, and opinion pieces

Studies not reporting quantitative performance metrics

Studies focused solely on drug discovery or treatment outcome prediction (not diagnosis)

Non-English publications

**Data extraction:** Two reviewers independently extracted data from each included study. Disagreements were resolved by consensus or by a third reviewer. Extracted data included: author, year, disease type, AI technique, dataset size, imaging modality, performance metrics, and key findings.

**Quality assessment:** The authors assessed study quality using a custom checklist based on PRISMA items, but they did not use a validated risk-of-bias tool such as QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies). This is a notable weakness.

**Statistical approach:** The authors did not perform a quantitative meta-analysis. They did not calculate pooled effect sizes, heterogeneity statistics (I²), or publication bias tests (funnel plots, Egger's test). Instead, they reported ranges of performance metrics across studies and presented qualitative comparisons. This means the review is descriptive rather than inferential.

**What this design can and cannot prove:**

**Can prove:**

The range of AI diagnostic performance reported in the literature up to October 2020

Which diseases have been most studied with AI

Which AI techniques are most commonly applied to each disease

Qualitative trends in performance across disease types

**Cannot prove:**

Whether AI is definitively better than human clinicians (no head-to-head comparison with pooled effect sizes)

The average or typical performance of AI across studies (no meta-analytic pooling)

Whether performance differences between AI techniques are statistically significant

Whether publication bias inflates reported performance

Whether results generalise to real-world clinical settings (most studies used curated, high-quality datasets)

**Major methodological weaknesses:**

1. **No quantitative meta-analysis:** Without pooled effect sizes and confidence intervals, readers cannot determine the precision or reliability of the reported performance ranges.

2. **No risk-of-bias assessment:** The authors did not use a validated tool to assess the quality of individual diagnostic accuracy studies. Many included studies likely had high risk of bias (small sample sizes, no external validation, no blinding).

3. **No assessment of publication bias:** Diagnostic AI studies with negative or null results are less likely to be published. Without assessing this, the reported performance ranges are likely inflated.

4. **Heterogeneous studies:** The included studies varied enormously in sample size, population, imaging modality, AI technique, and outcome definition. Combining them without statistical adjustment is problematic.

5. **No preregistration:** The review protocol was not preregistered (e.g., on PROSPERO), increasing the risk of selective reporting and post-hoc decisions.

6. **Retraction:** The article was retracted on 09 March 2026. The retraction notice (doi:10.1007/s12652-026-05063-w) should be consulted for specific reasons, but retractions typically occur due to concerns about data integrity, authorship, peer review, or ethical issues.

Key findings

The authors reported the following ranges of AI diagnostic performance across the nine disease categories. Note: these are ranges reported across individual studies, not pooled estimates with confidence intervals.

**Alzheimer's disease:** Accuracy ranged from 85% to 98% across studies using MRI and PET imaging with deep learning models. Sensitivity ranged from 82% to 96%, specificity from 84% to 97%. AUC ranged from 0.88 to 0.99.

**Cancer (all types):** Accuracy ranged from 80% to 99% across studies. For breast cancer specifically, accuracy ranged from 85% to 97% using mammography and histopathology images. For skin cancer, accuracy ranged from 82% to 95% using dermoscopic images. AUC values ranged from 0.85 to 0.99.

**Diabetes:** Accuracy ranged from 75% to 98% across studies. For diabetic retinopathy detection from retinal fundus images, accuracy ranged from 82% to 97%. For diabetes prediction from clinical data, accuracy ranged from 75% to 92%. AUC ranged from 0.78 to 0.97.

**Chronic heart disease:** Accuracy ranged from 78% to 95% across studies. For heart attack prediction using ECG and clinical data, accuracy ranged from 80% to 93%. AUC ranged from 0.80 to 0.96.

**Tuberculosis:** Accuracy ranged from 80% to 97% across studies using chest X-ray and CT imaging. Sensitivity ranged from 78% to 96%, specificity from 82% to 98%. AUC ranged from 0.82 to 0.98.

**Stroke and cerebrovascular disease:** Accuracy ranged from 75% to 94% across studies using brain CT and MRI. For stroke subtype classification, accuracy ranged from 78% to 92%. AUC ranged from 0.78 to 0.95.

**Hypertension:** Accuracy ranged from 72% to 93% across studies using retinal imaging, ECG, and clinical data. AUC ranged from 0.75 to 0.94.

**Skin disease:** Accuracy ranged from 78% to 97% across studies using dermoscopic and clinical images. For melanoma detection specifically, accuracy ranged from 82% to 95%. AUC ranged from 0.80 to 0.98.

**Liver disease:** Accuracy ranged from 76% to 96% across studies using ultrasound, CT, MRI, and blood biomarkers. For liver cancer detection, accuracy ranged from 80% to 94%. AUC ranged from 0.78 to 0.96.

**Primary vs. secondary outcomes:** The primary outcome was diagnostic accuracy (the main performance metric reported in most studies). Secondary outcomes included sensitivity, specificity, AUC, precision, recall, and F1-score. The authors did not pre-specify primary and secondary outcomes in a registered protocol.

**Comparison across AI techniques:** The authors reported that deep learning models (particularly convolutional neural networks) generally outperformed traditional machine learning models (support vector machines, random forests, decision trees) for image-based diagnosis. For tabular clinical data, traditional machine learning models performed comparably to deep learning.

**Comparison across diseases:** AI performance was highest for Alzheimer's disease and cancer (accuracy up to 99%) and lowest for hypertension and stroke (accuracy as low as 72%). However, these comparisons are confounded by differences in dataset quality, sample size, and imaging modality across disease categories.

Effect magnitude

Because this is a systematic review without meta-analytic pooling, precise effect magnitudes cannot be stated. However, the reported ranges suggest:

**Alzheimer's disease:** AI models correctly identified Alzheimer's in approximately 9 out of 10 cases (sensitivity ~90%) and correctly ruled it out in approximately 9 out of 10 healthy cases (specificity ~90%). This is comparable to or slightly better than human radiologists interpreting MRI scans, though direct comparisons were not systematically analysed.

**Cancer:** AI models detected cancer with approximately 85–95% accuracy across studies. For skin cancer, this means that out of 100 patients with melanoma, the AI would correctly identify 82–95 of them, missing 5–18. Out of 100 patients without melanoma, the AI would correctly classify 82–98 as healthy, falsely flagging 2–18 as having cancer.

**Diabetes:** For diabetic retinopathy screening, AI models correctly identified sight-threatening disease in approximately 85–95% of cases. This is roughly equivalent to the performance of human ophthalmologists in screening settings.

**Heart disease:** AI models predicted heart attack risk with approximately 80–93% accuracy. This means that out of 100 patients who will have a heart attack, the AI correctly identifies 80–93, missing 7–20. Out of 100 patients who will not have a heart attack, the AI correctly identifies 80–95, falsely flagging 5–20.

**Tuberculosis:** AI models detected active TB on chest X-ray with approximately 85–95% accuracy. This is comparable to human radiologists in high-burden settings.

**Stroke:** AI models detected acute stroke on brain imaging with approximately 78–92% accuracy. This is lower than human neuroradiologists (typically >95% for large vessel occlusion) but may be useful in settings without specialist availability.

**Hypertension:** AI models predicted hypertension from retinal images with approximately 75–90% accuracy. This is modest compared to simple blood pressure measurement (which is near 100% accurate when done correctly).

**Skin disease:** AI models classified skin lesions with approximately 82–95% accuracy for melanoma detection. This is comparable to dermatologists in controlled settings but may be lower in real-world conditions with variable image quality.

**Liver disease:** AI models detected liver abnormalities with approximately 80–94% accuracy. This is comparable to human radiologists for fatty liver disease but lower for early-stage cirrhosis.

**Important caveat:** These effect magnitudes are based on reported ranges from individual studies, not pooled estimates. The true average performance across all studies is unknown. Moreover, most studies used curated, high-quality datasets that may not reflect real-world clinical conditions (variable image quality, different patient demographics, different disease prevalence).

Limitations

**What the authors acknowledge:**

The review only included English-language publications, potentially missing relevant studies in other languages.

The search was conducted up to October 2020, so more recent studies (including those using newer AI techniques like transformers and large language models) were not included.

The authors noted that comparing performance across studies was difficult due to differences in datasets, evaluation metrics, and AI techniques.

The review did not assess the clinical

Read full paper →More Zone 2 research