A meta systematic review of artificial intelligence in higher education: a call for increased ethics, collaboration, and rigour
Read full paper →- Authors
- Melissa Bond, Hassan Khosravi, Maarten de Laat, Nina Bergdahl, Violeta Negrea, Emily Oxley, Phuong Pham, Sin Wang Chong, George Siemens
- Journal
- International Journal of Educational Technology in Higher Education
- Year
- 2024
- Citations
- 503
TL;DR
This meta-review of 66 systematic reviews on AI in higher education found that most research focuses on adaptive systems and personalisation, but suffers from weak methodology, limited ethical consideration, and a lack of interdisciplinary collaboration — meaning the evidence base for using AI tools in your own learning is currently thin and unreliable.
What they tested
This is a **meta-review (review of reviews)** — it did not test any single intervention itself. Instead, it synthesised the findings from 66 existing systematic reviews that had already summarised primary studies on artificial intelligence in higher education (AIHEd). The authors examined:
**What types of AI applications** were studied (e.g., intelligent tutoring systems, adaptive learning platforms, predictive analytics, automated assessment)
**How the research was conducted** (methodological quality, sample sizes, duration, blinding, etc.)
**What benefits and challenges** were reported across reviews
**What research gaps** existed in the literature
The comparator was not a placebo or control condition — it was the absence of rigorous evidence. The outcome measures were qualitative themes extracted from the reviews themselves, not quantitative effect sizes from individual studies.
Who was studied
This meta-review analysed **66 secondary studies** (systematic reviews, scoping reviews, meta-analyses) published between 2018 and July 2023. The reviews themselves covered primary studies involving:
**Students in formal higher education** (undergraduate and postgraduate) and continuing education settings
**Educators and administrators** using AI tools
**Institutions** implementing AI systems
The primary studies within those reviews ranged from small pilot studies (e.g., 20–50 students in a single course) to large-scale deployments (e.g., thousands of students across multiple universities). However, the meta-review does not report a single pooled sample size because it synthesised reviews, not individual participants.
Geographically, the reviews were authored predominantly by researchers from **North America (27.3%)**, followed by Asia, Europe, and Australia. Only **28.8%** of reviews involved international collaboration — most were domestic-only teams.
How they measured it
The authors used a structured data extraction framework in **EPPI-Reviewer** software. They coded each of the 66 reviews for:
**Type of evidence synthesis** (systematic review, scoping review, meta-analysis, etc.)
**Thematic focus** using Zawacki-Richter et al.'s (2019) AIEd typology: (1) Profiling and Prediction, (2) Intelligent Tutoring Systems, (3) Assessment and Evaluation, (4) Adaptive Systems and Personalisation
**Methodological quality** assessed using a custom quality appraisal tool (not a standardised instrument like AMSTAR or ROBIS)
**Reported benefits and challenges** (coded qualitatively)
**Research gaps identified** by the original review authors
The authors also tracked publication venues, authorship collaboration patterns, and the software tools used to conduct the reviews.
Methodology
**Study design:** This is a **tertiary review** (also called a review of reviews or umbrella review). It systematically searched for, appraised, and synthesised evidence from secondary studies (systematic reviews) rather than primary studies.
**Search strategy:** The authors searched **seven databases**: Web of Science, Scopus, ERIC, EBSCOHost, IEEE Xplore, ScienceDirect, and ACM Digital Library. They also used **snowballing** in OpenAlex, ResearchGate, and Google Scholar. The search covered publications from **2018 to July 2023**.
**Inclusion criteria:** Reviews were included if they:
Synthesised applications of AI solely in formal higher or continuing education
Were published in English
Were journal articles or full conference papers
Had a methods section
Were published between 2018 and July 2023
**Exclusion criteria:** They excluded opinion pieces, editorials, conceptual papers without systematic methods, and reviews focused on K-12 education.
**Screening and data extraction:** Two reviewers independently screened titles/abstracts, then full texts. Disagreements were resolved through discussion or a third reviewer. Data extraction was performed by one reviewer and checked by another.
**Quality appraisal:** The authors developed their own quality assessment tool rather than using validated instruments like AMSTAR-2 or ROBIS. This is a **major methodological weakness** — it means the quality ratings are not directly comparable to other meta-reviews and may lack established validity.
**Synthesis approach:** The authors used **thematic synthesis** — they identified recurring themes across the 66 reviews and organised them into categories. They did **not** perform a quantitative meta-analysis because the reviews used different outcome measures and reported heterogeneous results.
**What this design can and cannot prove:**
**Can prove:**
The scope and volume of AIHEd research
Common themes, gaps, and methodological patterns across the literature
The types of AI applications being studied
The geographic and collaborative distribution of research teams
**Cannot prove:**
Whether any specific AI intervention actually improves learning outcomes (that would require meta-analysis of primary studies)
The magnitude of any effect (no pooled effect sizes)
Causal relationships between AI use and student achievement
Which AI tool is "best" for a given educational context
**Major methodological weaknesses flagged by the authors themselves:**
Only 66.7% of included reviews were actual systematic reviews — the rest were scoping reviews, narrative reviews, or other less rigorous synthesis types
Quality appraisal was done with a non-validated tool
The reviews themselves often had poor methodological quality (see Key Findings)
Publication bias likely exists — positive results are more likely to be published
The rapid pace of AI development means findings may already be outdated
Key findings
**Primary findings (scope and nature of AIHEd research):**
**47.0%** of reviews focused on AIHEd generally (broad overviews), while **28.8%** focused on **Profiling and Prediction** (e.g., predicting student dropout, performance forecasting)
**Adaptive Systems and Personalisation** was the most commonly reported application area across reviews, followed by **Intelligent Tutoring Systems** and **Assessment and Evaluation**
**66.7%** of included reviews were systematic reviews; the remainder were scoping reviews, narrative reviews, or other types
**89.4%** of reviews were conducted by teams, but **71.2%** were domestic-only collaborations (no international co-authors)
**27.3%** of reviews had first authors from North America
**Secondary findings (methodological quality and gaps):**
**Quality was generally low to moderate** — the authors report that many reviews lacked:
- Pre-registered protocols
- Dual independent screening
- Explicit quality appraisal of included primary studies
- Clear reporting of effect sizes or confidence intervals
**Ethical considerations were minimal** — most reviews did not discuss privacy, bias, fairness, transparency, or the potential for AI to exacerbate educational inequalities
**Interdisciplinary collaboration was rare** — most reviews were conducted within single disciplines (e.g., computer science or education), not across fields like ethics, psychology, or sociology
**Contextual factors were underreported** — few reviews described the institutional setting, student demographics, or implementation details needed to assess generalisability
**Reported benefits across reviews (qualitative synthesis):**
Personalised learning pathways and adaptive content delivery
Automated grading and feedback reducing instructor workload
Early warning systems for at-risk students
Improved student engagement through chatbots and intelligent tutoring
**Reported challenges across reviews:**
Lack of transparency in AI algorithms ("black box" problem)
Data privacy and security concerns
Potential for algorithmic bias against minority groups
High implementation costs and technical requirements
Insufficient teacher training and support
Lack of rigorous experimental designs (few RCTs, small samples, short durations)
**Research gaps identified:**
Need for more longitudinal studies (most were cross-sectional or short-term)
Need for studies in diverse institutional contexts (most were in well-resourced universities)
Need for ethical frameworks and guidelines specific to AI in education
Need for interdisciplinary research teams
Need for more rigorous study designs (RCTs, quasi-experimental with control groups)
Effect magnitude
This meta-review does **not** report effect sizes because it synthesised qualitative themes across reviews, not quantitative outcomes from primary studies. The authors explicitly state that the evidence base is too heterogeneous and methodologically weak to calculate pooled effects.
In plain English: **we cannot say how much AI tools improve learning, retention, or engagement** because the existing research is not rigorous enough to give reliable numbers. The authors call this a "call for increased ethics, collaboration, and rigour" — meaning the field needs better studies before anyone can trust the claimed benefits.
Limitations
**Limitations acknowledged by the authors:**
Only included English-language publications (may miss important non-English research)
Search period ended July 2023 — given the rapid pace of AI development (especially generative AI like ChatGPT), findings may already be outdated
Quality appraisal used a non-validated tool developed by the authors
Some reviews may have been missed despite the comprehensive search strategy
The meta-review cannot assess the quality of primary studies within the included reviews — it only assesses the reviews themselves
**Critical limitations a reader should note:**
**No quantitative synthesis** — without effect sizes, you cannot compare the effectiveness of different AI approaches
**Publication bias is likely** — reviews that found null or negative effects may not have been published, skewing the apparent benefits
**Heterogeneity is extreme** — the 66 reviews covered different AI tools, different educational contexts, different outcome measures, and different study designs, making meaningful synthesis nearly impossible
**The "garbage in, garbage out" problem** — if the primary studies within the reviews are weak, the reviews themselves (and this meta-review) inherit those weaknesses
**No cost-benefit analysis** — even if AI tools work, the reviews do not address whether they are worth the financial, time, and privacy costs
**Generalisability is limited** — most research comes from well-resourced universities in North America, Europe, and Asia; findings may not apply to under-resourced institutions, developing countries, or non-traditional learners
Practical takeaways
For someone running their own n=1 experiment on using AI in your learning:
### What to test (specific intervention and dose)
**Test one AI tool at a time.** For example: use an AI tutoring system (like Khan Academy's Khanmigo or a custom GPT tutor) for one specific subject for 30 minutes per day, 5 days per week. Do NOT try multiple AI tools simultaneously — you won't know which one caused any effect.
**Test a specific feature.** Instead of "using AI for studying," test "using an AI chatbot to generate practice questions" or "using an AI writing assistant to get feedback on essay drafts." Be as narrow as possible.
**Dose matters.** Try 20–30 minutes per session. The reviews suggest that longer exposure (weeks to months) is needed to see learning gains, not single sessions.
### Minimum meaningful duration
**At least 4 weeks.** Most primary studies in the reviews lasted 2–8 weeks. Shorter durations cannot distinguish novelty effects (excitement about new technology) from genuine learning improvements.
**Run a 2-week baseline phase** where you measure your performance WITHOUT the AI tool, then a 4-week intervention phase WITH the tool, then a 2-week washout phase WITHOUT the tool to see if effects persist.
**Total minimum: 8 weeks** (2 baseline + 4 intervention + 2 washout).
### What to measure (specific metrics)
**Primary outcome:** A standardised test score or grade in the subject you're studying. Use the same test before and after (or parallel forms). For example, if studying statistics, take a 20-question multiple-choice test at baseline, week 4, and week 8.
**Secondary outcomes:**
- **Time spent studying** (minutes per day, logged objectively via app timers)
- **Retention** (test the same material 1 week and 4 weeks after learning)
- **Confidence/self-efficacy** (rate on a 1–10 scale: "How confident are you that you understand this topic?")
- **Engagement** (subjective rating 1–10: "How focused did you feel during today's study session?")
- **Frustration/fatigue** (subjective rating 1–10: "How mentally drained do you feel after using the AI tool?")
**Track daily** in a simple spreadsheet or notebook. Do NOT rely on memory.
### Key confounds to control for
**Novelty effect:** The first 1–2 weeks of using any new tool often show artificial improvements just because it's new. This is why you need a 4-week minimum intervention.
**Time-on-task confound:** If you spend more time studying with AI than without, any improvement could be due to more study time, not the AI itself. **Control for this** by keeping total study time constant across phases (e.g., 30 minutes per day regardless of tool).
**Subject difficulty:** Do NOT test AI on an easy topic during the intervention and a hard topic during baseline. Use the same subject material throughout.
**Sleep, stress, diet:** These affect learning more than most AI tools. Track them daily (1–10 ratings) and note any major disruptions.
**Expectation bias:** You might unconsciously work harder because you expect the AI to help. To minimise this, use objective measures (test scores) rather than subjective ones (self-rated learning).
**AI tool changes:** AI tools update frequently. If the tool changes mid-experiment (e.g., a new version of ChatGPT), note the date and version.
### What a positive result would look like
**Test score improvement of ≥15%** from baseline to end of intervention (e.g., from 60% to 75% correct on the same test)
**Retention improvement:** You score ≥10% higher on a delayed test (1 week later) compared to your baseline retention
**Time efficiency:** You achieve the same test score in less study time (e.g., 20 minutes with AI vs. 30 minutes without)
**Consistency:** The improvement is seen across multiple weeks, not just the first week (rules out novelty)
**Reversal in washout:** Your scores drop back toward baseline during the 2-week no-AI phase (suggests the AI was actually causing the improvement)
**If you see no improvement after 4 weeks, do NOT conclude AI is useless.** The meta-review shows that most AI tools have small, inconsistent effects. A null result in your n=1 experiment is informative — it tells you that tool doesn't work for you in that context. Try a different tool, a different dose, or a different subject.
**Bottom line from this meta-review:** The research on AI in higher education is too weak to give reliable recommendations. Your own n=1 experiment, if done rigorously, may be more informative than most published studies. But be sceptical of large claims — the evidence base is thin, and the authors explicitly call for "increased ethics, collaboration, and rigour."