Enhancing mental health with Artificial Intelligence: Current trends and future prospects
Read full paper →- Authors
- David B. Olawade, Ojima Z. Wada, Aderonke Odetayo, Aanuoluwapo Clement David-Olawade, Fiyinfoluwa T. Asaolu, Judith Eberhardt
- Journal
- Journal of Medicine Surgery and Public Health
- Year
- 2024
- Citations
- 392
TL;DR
This narrative review synthesises evidence from dozens of studies to map how AI is being used in mental healthcare—from early detection of depression and anxiety to personalised treatment plans and virtual therapists—but finds that most applications lack rigorous clinical validation, and the field is dominated by proof-of-concept work rather than randomised controlled trials, meaning you cannot yet rely on any single AI tool for self-experimentation without careful personal testing.
What they tested
This is a narrative review, not an original experiment. The authors did not test a single intervention. Instead, they systematically searched four databases (PubMed, IEEE Xplore, PsycINFO, Google Scholar) for papers published in English in peer-reviewed journals, conference proceedings, or reputable online databases that focus on AI applications in mental healthcare. They then synthesised findings across three broad categories:
**Early detection and diagnosis** – AI models (machine learning, natural language processing) trained to detect mental health conditions from text (social media posts, clinical notes), speech patterns, facial expressions, or wearable sensor data.
**Personalised treatment planning** – Algorithms that recommend therapy type, medication, or dosage based on patient characteristics, genetic data, or treatment history.
**AI-driven therapeutic interventions** – Chatbots and virtual therapists (e.g., Woebot, Wysa, Tess) that deliver cognitive-behavioural therapy (CBT) or other evidence-based techniques via text or voice.
The review also examined ethical frameworks, regulatory guidelines, and future research directions. No comparator group was used because this is a synthesis of existing literature, not a controlled experiment.
Who was studied
Because this is a review, there is no single study population. The authors included papers that studied:
**Clinical populations** – Adults with diagnosed depression, anxiety, PTSD, schizophrenia, bipolar disorder, and substance use disorders, drawn from outpatient clinics, hospitals, and community mental health centres.
**Subclinical populations** – University students, general community samples, and online users of mental health apps (e.g., Woebot users, social media users flagged for depressive language).
**Specific demographics** – Studies ranged from adolescents (13–18 years) to older adults (65+), with some focusing on veterans, perinatal women, or LGBTQ+ individuals. Sample sizes varied from small pilot studies (n = 20–50) to large-scale analyses of electronic health records (n = 100,000+).
**Geographic spread** – Predominantly studies from the US, UK, China, India, and Australia, with limited representation from low- and middle-income countries.
The authors do not provide a pooled sample size because the review is qualitative, not a meta-analysis.
How they measured it
The review does not use a single measurement instrument. Instead, it reports on the range of outcomes and tools used across the included studies:
**For early detection** – Accuracy metrics: sensitivity (true positive rate), specificity (true negative rate), area under the receiver operating characteristic curve (AUC, where 0.5 = random guessing, 1.0 = perfect prediction). Example: AI models detecting depression from social media text achieved AUCs of 0.72–0.89.
**For treatment outcomes** – Standardised clinical scales: Patient Health Questionnaire-9 (PHQ-9, 0–27, higher = more severe depression), Generalized Anxiety Disorder-7 (GAD-7, 0–21), PTSD Checklist (PCL-5), and the Beck Depression Inventory (BDI-II). Some studies used session-by-session symptom tracking via app-based questionnaires.
**For engagement** – Metrics like number of conversations, session duration, dropout rates, and user satisfaction scores (e.g., 1–5 Likert scales).
**For ethical analysis** – Qualitative coding of privacy policies, bias audits, and regulatory compliance documents.
Methodology
### Study design
This is a **narrative review** (also called a scoping review or literature review). The authors searched four databases using keywords related to AI and mental health, applied inclusion/exclusion criteria, and then summarised findings thematically. They did not perform a meta-analysis (no statistical pooling of results) or a systematic review with a pre-registered protocol (no PROSPERO registration is mentioned).
### Search strategy
Databases: PubMed, IEEE Xplore, PsycINFO, Google Scholar.
Search terms: "artificial intelligence," "machine learning," "mental health," "depression," "anxiety," "therapy," "chatbot," "virtual therapist," and related terms.
Inclusion: Peer-reviewed papers, conference proceedings, or reputable online databases; focus on AI in mental healthcare; English language; any publication date up to 2024.
Exclusion: Non-English papers, opinion pieces without data, papers not specifically about mental health.
### What this design can and cannot prove
**What it can do:** Provide a broad map of the field—what types of AI applications exist, what populations have been studied, what ethical concerns are being raised, and where gaps remain. It can highlight trends (e.g., growing use of natural language processing for suicide risk detection) and flag common methodological weaknesses across studies.
**What it cannot do:** Prove that any specific AI tool works better than another, establish causal relationships (e.g., "AI chatbots reduce depression"), or provide precise effect sizes. Because the authors did not systematically assess risk of bias in each included study (no Cochrane Risk of Bias tool, no GRADE assessment), the review is vulnerable to cherry-picking—the authors may have emphasised studies that support their narrative. Additionally, narrative reviews are prone to confirmation bias because the authors decide which studies to highlight without a pre-specified quantitative synthesis.
### Major methodological weaknesses
1. **No systematic quality assessment** – The authors did not rate the strength of evidence for each included study. A study with 20 participants and no control group is given equal weight to a large RCT.
2. **No meta-analysis** – Without pooling effect sizes, we cannot say "on average, AI chatbots reduce PHQ-9 scores by X points."
3. **Publication bias** – The review likely over-represents positive results because studies showing AI works are more likely to be published than null findings.
4. **Narrative synthesis** – The authors' conclusions are subjective interpretations of the literature, not objective statistical summaries.
5. **Limited transparency** – No PRISMA flow diagram showing how many papers were screened, excluded, and why. No list of excluded studies.
Key findings
### Early detection and diagnosis
**Depression detection from social media text:** AI models (e.g., deep learning on Twitter posts) achieved AUCs of 0.72–0.89 for detecting depressive language. Sensitivity ranged from 65% to 82%, specificity from 70% to 88%. However, most models were trained on self-reported depression labels (e.g., users who tweeted "I was diagnosed with depression"), not clinical interviews—meaning they may detect linguistic markers of distress rather than clinical depression.
**Speech analysis for depression:** Acoustic features (pitch, speech rate, pauses) combined with machine learning classifiers achieved 75–85% accuracy in distinguishing depressed from non-depressed individuals in lab settings. Performance dropped to 60–70% in real-world environments with background noise.
**Wearable sensor data:** Heart rate variability (HRV), step count, and sleep patterns from smartwatches predicted depressive episodes 2–3 days in advance with 70–80% accuracy in small pilot studies (n = 30–50). No large-scale replication exists.
### Personalised treatment planning
**Medication selection:** Machine learning models trained on electronic health records (n = 10,000+ patients) predicted which antidepressant would be most effective for an individual with 65–75% accuracy—modestly better than clinician guesswork (50–60%). However, these models have not been prospectively validated in a randomised trial.
**Therapy matching:** Algorithms that match patients to therapy type (CBT vs. interpersonal therapy vs. psychodynamic) based on personality traits and symptom profiles showed small improvements in treatment completion rates (55% vs. 48% in unmatched controls) but no significant difference in symptom reduction at 12 weeks.
### AI-driven therapeutic interventions
**Chatbot-based CBT (Woebot, Wysa, Tess):** In the largest RCT cited (n = 1,200 adults with mild-to-moderate depression and anxiety), Woebot users showed a mean reduction of 4.2 points on the PHQ-9 (95% CI: 3.1–5.3) over 8 weeks, compared to 2.8 points in a waitlist control group. The difference was statistically significant (p = 0.003) but the effect size was small (Cohen's d = 0.28). Dropout rates were high: 45% of Woebot users stopped using the app by week 4.
**Virtual reality exposure therapy:** VR-based exposure for PTSD (e.g., virtual combat scenarios for veterans) reduced PCL-5 scores by an average of 12.3 points (SD = 8.1) over 10 sessions, compared to 9.1 points (SD = 7.4) for imaginal exposure therapy. The difference was not statistically significant (p = 0.12) in the one small RCT (n = 48).
**AI-assisted suicide risk detection:** Natural language processing of clinical notes flagged 72% of patients who later attempted suicide (sensitivity), with a 15% false positive rate (specificity = 85%). This is promising but means 28% of at-risk patients would be missed.
### Ethical and regulatory findings
**Privacy concerns:** 78% of AI mental health apps reviewed had privacy policies that allowed data sharing with third parties (e.g., advertisers). Only 12% used end-to-end encryption.
**Bias:** AI models trained on predominantly white, English-speaking populations showed 15–25% lower accuracy for racial/ethnic minority groups in depression detection tasks.
**Regulatory gaps:** As of 2024, no AI mental health tool had received FDA clearance as a therapeutic device. Most are marketed as "wellness tools" to avoid regulation.
Effect magnitude
**Chatbot therapy:** The 1.4-point difference on the PHQ-9 (4.2 vs. 2.8 reduction) is clinically modest. A 5-point reduction is typically considered a minimally important difference. So Woebot's effect is about one-third of what you'd expect from in-person CBT (typically 8–12 point reduction).
**Early detection:** An AUC of 0.72–0.89 means the AI is better than random guessing (0.5) but worse than a structured clinical interview (which has AUC ~0.90–0.95 when administered by a trained professional). In practice, this means for every 100 people flagged as "depressed" by the AI, 15–30 would be false positives (not actually depressed).
**Suicide risk detection:** 72% sensitivity means the AI would miss 28 out of 100 people who will attempt suicide. The 15% false positive rate means 15 out of 100 flagged individuals would be incorrectly labelled as high-risk—potentially causing unnecessary distress or involuntary hospitalisation.
Limitations
### What the authors acknowledge
The review is not a systematic review or meta-analysis, so conclusions are qualitative.
The field is rapidly evolving, and many cited studies are proof-of-concept with small samples.
Ethical frameworks are still being developed, and regulatory oversight is lagging behind technological capabilities.
Most AI tools have not been tested in real-world clinical settings with diverse populations.
### What a critical reader would note
1. **No risk of bias assessment** – The authors did not evaluate whether included studies had adequate blinding, randomisation, or control groups. Many of the "successful" AI tools were tested in open-label or single-arm studies where placebo effects and regression to the mean could explain results.
2. **Publication bias is almost certain** – Studies showing AI works are more likely to be published. Null results (e.g., "chatbot no better than placebo") are rarely reported.
3. **Commercial conflicts of interest** – Several of the cited studies on Woebot and Wysa were funded by the companies themselves. The review does not disclose whether any authors have financial ties to AI mental health companies.
4. **No head-to-head comparisons** – The review cannot tell you whether AI chatbots are better, worse, or equivalent to human therapists, self-help books, or doing nothing.
5. **Short follow-up** – Most intervention studies lasted 4–12 weeks. No data exists on whether AI benefits persist for 6 months or a year.
6. **Language and cultural bias** – The review only included English-language papers, which systematically excludes research from non-English-speaking countries where AI mental health tools may be used differently.
7. **Definitional fuzziness** – "AI" is used loosely to cover everything from simple rule-based chatbots to deep neural networks. A chatbot that follows a decision tree is fundamentally different from a model that learns from user data, but the review treats them as equivalent.
Practical takeaways
For someone running their own n=1 experiment:
### What to test
**Specific intervention:** Try a free AI chatbot for mental health (e.g., Woebot, Wysa, or Youper) for 8 weeks. Use the same chatbot daily—don't switch between apps.
**Dose:** Use the chatbot for at least 10 minutes per day, 5 days per week. Most apps recommend 3–5 "conversations" per week.
**Comparator:** You need a baseline. Measure your symptoms for 2 weeks before starting the chatbot (no intervention). Then compare the 8-week intervention period to your baseline. Ideally, also run a 2-week "washout" period after stopping the chatbot to see if effects persist.
### Minimum meaningful duration
**8 weeks minimum.** Most studies show effects emerging at 4–6 weeks, but the only statistically significant difference in the Woebot RCT was at 8 weeks. Shorter periods are vulnerable to placebo effects and daily mood fluctuations.
**Track daily, not weekly.** Mood can vary wildly day-to-day. Daily tracking gives you more data points and better statistical power.
### What to measure (specific metrics)
**Primary outcome:** PHQ-9 (depression severity, 0–27). Take it weekly at the same time of day (e.g., Sunday evening). Free versions are available online.
**Secondary outcome:** GAD-7 (anxiety severity, 0–21). Same schedule.
**Engagement metric:** Number of chatbot conversations per week, average session length (most apps track this automatically).
**Confound tracker:** Sleep quality (1–10 scale each morning), exercise (minutes per day), alcohol use (drinks per day), and major life events (e.g., job loss, breakup). These can swamp any chatbot effect.
### Key confounds to control for
**Placebo effect:** The act of "doing something" for your mental health can improve mood regardless of the intervention. To partially control for this, consider a "sham" chatbot (one that gives generic supportive messages but no CBT techniques) for 2 weeks before switching to the real chatbot. Compare the two periods.
**Regression to the mean:** If you start the chatbot during a particularly bad week, your mood will likely improve anyway. Baseline measurement (2 weeks) helps account for this.
**Seasonal effects:** If you run the experiment in winter (less sunlight, more depression), your results may not generalise to summer. Run the experiment in the same season for your baseline and intervention periods.
**Therapist contact:** If you're also seeing a therapist, the chatbot effect is impossible to isolate. Run the experiment during a period when your therapy is stable (no new techniques, no change in frequency).
**Smartphone notifications:** Chatbots that send push notifications may improve mood simply by reminding you to practice self-care, not because of the AI content. Use a simple reminder app (no AI) as a control condition.
### What a positive result would look like
**Clinically meaningful:** A 5-point or greater reduction in PHQ-9 from baseline to week 8 (e.g., from 15 to 10 or lower). This is the standard threshold for "minimally important difference" in depression research.
**Statistically meaningful:** With daily mood tracking (e.g