ObservationalWikiTop journalCold ExposureModerate

Cost-effectiveness of Artificial Intelligence–Based Retinopathy of Prematurity Screening

Read full paper →
Authors
S. Morrison, Dmitry Dukhovny, R.V. Paul Chan, Michael F. Chiang, J. Peter Campbell
Journal
JAMA Ophthalmology
Year
2022
Citations
46

TL;DR

An economic model of 52,000 US infants found that autonomous AI screening for retinopathy of prematurity (ROP) could be as effective as and less costly than standard ophthalmoscopy or telemedicine, potentially reducing late treatments from 265 to 40 cases per year if AI achieves 99% sensitivity.

What they tested

The researchers compared four screening strategies for retinopathy of prematurity (ROP), a potentially blinding eye disease in premature infants:

1. **Ophthalmoscopy** – A trained ophthalmologist examines the infant's retina directly with a handheld lens (the current standard of care in many hospitals).

2. **Telemedicine** – A retinal photograph is taken by a trained nurse or technician, then sent to a remote ophthalmologist for interpretation.

3. **Assistive AI with telemedicine review** – An AI algorithm analyzes the retinal image first, flags suspicious cases, and a telemedicine ophthalmologist reviews only the flagged images plus a random sample of normal ones.

4. **Autonomous AI** – The AI analyzes the image and only positive (disease-detected) results are reviewed by a human. Negative results are accepted without human review.

The primary outcome was **cost-effectiveness**, measured as the incremental cost per quality-adjusted life-year (QALY) gained. Secondary outcomes included the number of infants receiving timely treatment, late treatment, or correctly left untreated.

Who was studied

This was a **theoretical cohort** – not actual patients. The model simulated a typical annual US birth cohort of **52,000 infants** who meet criteria for ROP screening:

Born at ≤30 weeks' gestation, OR

Birth weight ≤1500 grams (3.3 pounds)

The model assumed these infants are screened in US hospital settings, including neonatal intensive care units (NICUs) and outpatient follow-up clinics. No real infants were enrolled; instead, the model used published data on ROP prevalence, screening accuracy, treatment outcomes, and costs from the Early Treatment for Retinopathy of Prematurity (ETROP) study and other published sources.

How they measured it

The study used **decision-tree economic modeling**, a standard method in health economics. Key inputs included:

**Screening accuracy** (sensitivity and specificity) for detecting "type 1 ROP" – the threshold requiring treatment – based on published studies of ophthalmoscopy, telemedicine, and AI algorithms.

**Treatment outcomes** from the ETROP study: timely treatment (good visual outcome), late treatment (worse visual outcome), or correctly untreated (no disease or mild disease that resolves on its own).

**Costs** based on Current Procedural Terminology (CPT) codes for ophthalmoscopy, retinal photography, and AI interpretation, plus estimated opportunity costs for physician time.

**Quality-adjusted life-years (QALYs)** – a standard metric combining length of life with quality of life (0 = death, 1 = perfect health). Visual outcomes were mapped to QALY weights from published literature.

**Willingness-to-pay threshold** of $100,000 per QALY gained – a common benchmark in US cost-effectiveness analysis.

Methodology

### Study design

This is a **cost-effectiveness analysis using decision-tree modeling** – not a clinical trial. The researchers built a mathematical model that simulates the clinical and economic outcomes of each screening strategy across a hypothetical cohort of 52,000 infants.

### How the model worked

The decision tree started with the entire cohort of infants eligible for ROP screening. For each screening strategy, the model branched based on:

Whether the screening test correctly identified disease (true positive), missed it (false negative), correctly ruled it out (true negative), or falsely flagged it (false positive)

Whether treatment was timely, late, or unnecessary

The visual outcome (good vs. poor) for each treatment timing

The lifetime costs and QALYs associated with each visual outcome

### Sensitivity analyses

The authors performed two types of sensitivity analysis:

1. **One-way sensitivity analysis** – Varying one input at a time (e.g., AI cost, AI sensitivity, prevalence of ROP) to see which factors most influenced the results.

2. **Probabilistic sensitivity analysis** – Running the model 10,000 times with all inputs randomly varied within plausible ranges to estimate the probability that each strategy is cost-effective at different willingness-to-pay thresholds.

### Secondary analysis

In a separate analysis, the model assumed AI had **99% sensitivity** for detecting severe ROP – higher than typical human examiners (reported at 80–90% in published studies). This simulated a scenario where AI outperforms humans at detecting the most dangerous cases.

### What this design can and cannot prove

**What it can prove:**

Under the assumptions built into the model, autonomous AI is likely to be cost-effective compared to alternatives.

Which variables most strongly influence cost-effectiveness (e.g., AI cost, AI sensitivity).

The potential population-level impact of adopting AI screening.

**What it cannot prove:**

That AI actually achieves the assumed sensitivity/specificity in real-world clinical practice (the model depends on published accuracy data, which may not generalize to all settings).

That the QALY weights assigned to visual outcomes are accurate for all patients.

That the cost estimates reflect actual hospital purchasing prices (CPT codes are standardized, but actual negotiated prices vary).

That the model captures all real-world complexities, such as missed follow-up appointments, image quality failures, or rare complications of treatment.

### Major methodological weaknesses

**No primary data collection** – The model is only as good as its inputs. If published sensitivity/specificity values for AI are optimistic, the results overstate AI's advantage.

**Single-payer perspective** – The analysis uses US Medicare/Medicaid costs, which may not reflect private insurance or international healthcare systems.

**No consideration of implementation barriers** – The model assumes perfect adoption and workflow integration, which rarely occurs in practice.

**Short time horizon for some outcomes** – Visual outcomes were modeled over a lifetime, but the model may not capture all long-term complications of ROP treatment (e.g., myopia, glaucoma).

Key findings

### Primary analysis (base-case assumptions)

**Autonomous AI was the dominant strategy** – it was both less costly and more effective (more QALYs gained) than ophthalmoscopy, telemedicine, or assistive AI.

**Cost-effectiveness thresholds** (the maximum additional cost per screening at which AI remains cost-effective):

- Assistive AI vs. telemedicine: up to **$7 per screening**

- Autonomous AI vs. telemedicine: up to **$34 per screening**

- Assistive AI vs. ophthalmoscopy: up to **$64 per screening**

- Autonomous AI vs. ophthalmoscopy: up to **$91 per screening**

### Probabilistic sensitivity analysis

Autonomous AI was **>60% likely to be cost-effective** at all willingness-to-pay levels (from $0 to $200,000 per QALY) compared to all other modalities.

Telemedicine was the next most likely to be cost-effective, followed by assistive AI, with ophthalmoscopy least likely.

### Secondary analysis (AI with 99% sensitivity)

**Late treatments** (ROP detected too late for optimal treatment):

- Ophthalmoscopy: **265 cases per year**

- Telemedicine: **160 cases per year**

- Autonomous AI: **40 cases per year**

This represents an **85% reduction** in late treatments compared to ophthalmoscopy and a **75% reduction** compared to telemedicine.

### Key drivers of cost-effectiveness

The **cost of the AI algorithm** was the most influential variable. If AI costs more than ~$91 per screening above ophthalmoscopy, it is no longer cost-effective.

**AI sensitivity** for detecting severe ROP was the second most important variable. If AI sensitivity drops below ~90%, its advantage over telemedicine diminishes.

**Prevalence of type 1 ROP** (about 5–10% of screened infants) had moderate influence – AI is more valuable when disease is more common.

Effect magnitude

In plain English: If autonomous AI screening were adopted nationwide for the ~52,000 US infants who need ROP screening each year, the model predicts:

**Cost savings** – Autonomous AI would save money compared to ophthalmoscopy (by reducing physician time) and compared to telemedicine (by reducing the need for human review of normal images).

**Fewer missed cases** – With 99% sensitivity, AI would miss only about 40 cases of severe ROP per year, compared to 265 missed by ophthalmoscopy. That's roughly **1 infant per week** vs. **5 infants per week** going blind from undetected ROP.

**Better visual outcomes** – The QALY gains are modest on a per-infant basis (fractions of a QALY) but substantial across the population (hundreds of QALYs saved per year).

To put the cost numbers in perspective: The model suggests that paying up to **$91 extra per screening** for autonomous AI (compared to ophthalmoscopy) is still cost-effective by US standards. For context, a single ophthalmoscopy screening might cost $100–200, so a $91 premium would roughly double the cost – but the model says this is still worth it because of the improved outcomes.

Limitations

### What the authors acknowledge

The model relies on published estimates of AI accuracy, which may not reflect real-world performance across different hospitals, cameras, and infant populations.

Cost estimates are based on CPT codes and may not reflect actual negotiated prices or institutional discounts.

The analysis assumes perfect follow-up and treatment compliance, which may not hold in practice.

QALY weights for visual outcomes are derived from adult populations and may not accurately reflect infant visual development.

### What a critical reader would note

**No real-world validation** – The model has not been tested in an actual clinical implementation. Real-world AI performance often degrades compared to research settings.

**Single-country focus** – Results may not generalize to countries with different healthcare systems, screening protocols, or cost structures.

**No consideration of AI failures** – The model assumes AI always produces a usable image. In practice, image quality failures (e.g., poor focus, eyelid artifacts) can be 5–15% of cases.

**No malpractice or liability costs** – If AI misses a case, who is responsible? These costs are not modeled.

**Industry funding** – The study was funded by the National Institutes of Health (NIH) and Research to Prevent Blindness, but some authors have financial ties to companies developing AI for ROP screening (disclosed in the paper).

**Theoretical cohort** – No actual infants were studied, so the model cannot account for rare events or unexpected complications.

Practical takeaways

For someone running their own n=1 experiment (e.g., a hospital administrator, NICU director, or health system innovator considering AI adoption):

### What to test

**Specific intervention:** Implement an autonomous AI system for ROP screening (e.g., i-ROP DL or a similar FDA-cleared algorithm) alongside your current screening protocol.

**Dose/frequency:** Screen all eligible infants at the standard schedule (typically weekly from 31 weeks postmenstrual age until the retina is fully vascularized, usually 4–8 screenings per infant).

### Minimum meaningful duration

**At least 6 months** to capture enough infants (a typical NICU might screen 50–200 infants per year).

**12 months preferred** to account for seasonal variations in NICU admissions and to allow the AI system to be calibrated to your specific camera and population.

### What to measure (specific metrics)

1. **Primary outcome:** Sensitivity and specificity of AI vs. ophthalmoscopy for detecting type 1 ROP (confirmed by a masked expert panel).

2. **Secondary outcomes:**

- Number of late treatments (treatment after 72 hours of diagnosis)

- Screening time per infant (nurse/technician time + physician time)

- Image quality failure rate (percentage of images deemed unreadable by AI)

- Cost per screening (including AI license, camera depreciation, staff time, physician time)

- Rate of false positives (infants unnecessarily referred for examination under anesthesia)

### Key confounds to control for

**Camera type** – Different retinal cameras produce different image quality. Use the same camera for both AI and telemedicine arms.

**Infant population** – Track gestational age, birth weight, and severity of illness (e.g., need for oxygen, sepsis). These affect ROP risk and may differ between comparison groups.

**Examiner experience** – Ophthalmoscopy accuracy varies by examiner. Use the same examiners throughout the study or randomize them.

**Seasonal effects** – NICU admissions may vary by season. Run the AI and comparison arms concurrently, not sequentially.

**Learning curve** – Staff may get faster at using AI over time. Include a 1-month run-in period before collecting data.

### What a positive result would look like

AI sensitivity ≥95% for type 1 ROP (with 95% confidence interval excluding <90%)

AI specificity ≥80% (to avoid excessive false positives)

Cost savings of at least $50 per screening compared to ophthalmoscopy (including all implementation costs)

Reduction in late treatments by at least 50% compared to historical baseline

Image quality failure rate <10% (ideally <5%)

### Additional considerations for your experiment

**Start with a pilot** of 50–100 infants to test workflow integration before scaling.

**Have a backup plan** – If AI fails to produce a readable image, fall back to ophthalmoscopy or telemedicine.

**Track all costs** – Include AI software licensing, camera maintenance, staff training, and any additional IT support.

**Get buy-in from ophthalmology** – AI screening changes workflow for ophthalmologists, who may resist if they feel their role is being diminished.

**Consider medicolegal implications** – Consult with risk management about liability if AI misses a case.

### Bottom line for your n=1

If you can implement autonomous AI for ROP screening at an incremental cost of less than ~$90 per screening compared to your current method, and if the AI achieves ≥95% sensitivity in your population, the model suggests it will be cost-effective and could prevent 1–5 cases of blindness per year in your NICU. Run your pilot for at least 6 months, measure sensitivity and costs rigorously, and compare to a concurrent control group using your standard screening method.

Test it on yourself

Run a structured cold exposure experiment

The research gives you a prior. Your own data tells you what actually works for you.

Cost-effectiveness of Artificial Intelligence–Based Retinopathy of Prematurity Screening | Steady Practice | SteadyPractice