Augmented Reality Learning Experiences: Survey of Prototype Design and Evaluation
Read full paper →- Authors
- Marc Ericson C. Santos, Angie Chen, Takafumi Taketomi, Goshiro Yamamoto, Jun Miyazaki, Hirokazu Kato
- Journal
- IEEE Transactions on Learning Technologies
- Year
- 2014
- Citations
- 459
TL;DR
Augmented reality (AR) learning experiences produce a moderate average improvement in student test performance (effect size = 0.56) compared to traditional instruction, but results vary wildly from small negative effects to very large positive effects depending on how the AR is designed and what subject is being taught.
What they tested
This is a meta-analysis and systematic review, not a single experiment. The researchers tested whether augmented reality (AR) learning experiences improve student performance on knowledge tests compared to traditional teaching methods (textbooks, lectures, 2D diagrams, or physical models without AR overlay).
**Intervention:** Any AR learning experience where digital information (3D models, text, animations, annotations) is overlaid onto the real world in real time, viewed through a head-mounted display, tablet, smartphone, or projector-based system.
**Comparators:** Traditional curriculum materials — textbooks, static 2D images, physical models, teacher-led demonstrations, or computer-based learning without AR.
**Outcome measures:**
Primary: Student performance on knowledge tests (standardised test scores, post-test scores, or exam grades)
Secondary: Qualitative observations about usability, engagement, and design features (analysed thematically, not statistically)
Who was studied
The meta-analysis included **7 studies** that provided enough data to compute effect sizes. These studies collectively involved **K-12 students** (pre-school through high school, approximately ages 4–18). The broader review of 87 articles covered a wider range of K-12 settings, but the quantitative meta-analysis is limited to those 7 studies.
**Specific populations from the 7 studies:**
Elementary school students (ages 6–12) learning science topics (e.g., astronomy, biology, physics)
Middle school students (ages 11–14) learning mathematics and geography
High school students (ages 14–18) learning chemistry and engineering concepts
**Setting:** Classroom environments during regular school hours, with AR experiences delivered via tablet computers, head-mounted displays, or projector-based systems.
**Important limitation:** The authors note that many of the 87 reviewed articles did not report sufficient statistical data (means, standard deviations, sample sizes) to be included in the meta-analysis. This means the 7 studies may not be representative of all AR learning research.
How they measured it
**For the meta-analysis:**
Standardised mean difference (Cohen's d or Hedges' g) calculated from post-test scores comparing AR group vs. control group
Effect sizes were computed from reported means, standard deviations, and sample sizes
**For the qualitative analysis of design features:**
The authors categorised each of the 87 articles by:
- Display hardware type (head-mounted display, handheld tablet, projector, monitor-based)
- Software libraries used (ARToolKit, Vuforia, custom solutions)
- Content authoring approach (pre-built by researchers vs. tools for teachers to create content)
- Evaluation technique (user study, usability test, expert review, no evaluation)
- Learning theory grounding (multimedia learning theory, experiential learning theory, animate vision theory)
**No standardised psychometric instruments were used across studies** — each study used its own custom knowledge test, making direct comparison difficult. This is a major methodological weakness the authors acknowledge.
Methodology
### Study Design
This is a **meta-analysis combined with a systematic review**. The authors searched IEEE Xplore and other learning technology databases for articles published up to 2013. They identified 87 relevant articles on AR learning experiences for K-12 education. Of these, 43 conducted user studies (i.e., tested with actual students), and only 7 provided enough statistical data to compute effect sizes.
### Search and Selection
Databases searched: IEEE Xplore, plus unspecified "other learning technology publications"
Keywords: "augmented reality," "learning," "education," "K-12," "evaluation"
Inclusion criteria: Must describe an AR learning experience for K-12; must be a prototype or deployed system; must include some form of evaluation
Exclusion criteria: AR for higher education or professional training; purely technical papers with no learning evaluation
### Statistical Approach
Effect sizes were computed using Cohen's d (standardised mean difference)
A random-effects model was used to pool effect sizes (appropriate when studies are expected to have different true effects due to different populations, subjects, and AR designs)
Heterogeneity was assessed (the authors report "widely variable" effects, indicating high heterogeneity)
No publication bias assessment (e.g., funnel plot) was reported
### What This Design Can Prove
A meta-analysis can estimate the **average effect** of AR learning across multiple studies, increasing statistical power and generalisability
It can identify **moderators** (e.g., subject matter, age group, hardware type) that influence effectiveness
The qualitative analysis can reveal **design patterns** that correlate with better outcomes
### What This Design Cannot Prove
**Causality:** The meta-analysis combines correlational and quasi-experimental studies. Without randomised controlled trials, you cannot conclude that AR *causes* better learning — selection bias (teachers who choose AR may be more motivated) could explain results
**Mechanism:** The meta-analysis cannot tell you *why* AR helps (or doesn't). Is it the novelty effect? Better visualisation? Increased engagement? The qualitative analysis attempts to address this but is speculative
**Generalisability:** Only 7 studies with sufficient data. K-12 education varies enormously by country, curriculum, teacher quality, and student background. The average effect may not apply to your specific context
**Long-term retention:** Most studies measured immediate post-test performance only. No data on whether AR improves long-term memory or transfer of learning
### Major Methodological Weaknesses
1. **Only 7 studies in the meta-analysis** — very small sample for a meta-analysis, making the average effect size unreliable
2. **No standardised outcome measures** — each study used its own test, so effect sizes may reflect test difficulty differences rather than true learning differences
3. **No blinding** — teachers and students knew they were using AR, creating potential for Hawthorne effects (novelty of technology improving performance regardless of content)
4. **Publication bias likely** — studies with null or negative results are less likely to be published, inflating the average effect size
5. **No control for prior knowledge** — students who volunteer for AR studies may be more tech-savvy or motivated
6. **Short intervention durations** — most studies were single-session or one-week interventions, so novelty effects are a major confound
Key findings
### Primary Outcome: Effect on Student Performance
**Mean effect size: 0.56 (moderate effect)** — This means the average student in the AR group scored about half a standard deviation higher than the average student in the control group
**Range of effects: from -0.2 (small negative) to 1.8 (very large positive)** — This extreme variability means AR is not universally beneficial; it works well in some contexts and poorly in others
**Heterogeneity: High** — The authors do not report a specific I² statistic, but describe the effects as "widely variable," indicating that the average effect size is not meaningful for all situations
### Secondary Outcomes: Qualitative Design Insights
**Three inherent advantages of AR for learning:**
1. **Real-world annotation** — AR can label real objects with text, arrows, or highlights (e.g., pointing to a plant and showing its Latin name). This reduces the need for students to mentally map between a diagram and the real object
2. **Contextual visualisation** — AR can show invisible phenomena (e.g., magnetic field lines, air flow, internal organs) overlaid on the real world. This helps students understand abstract concepts that are hard to visualise from 2D diagrams
3. **Vision-haptic visualisation** — AR can combine what you see with what you touch (e.g., a virtual chemical reaction that responds when you tilt a real tablet). This grounds abstract concepts in physical interaction
**Design features associated with positive outcomes:**
Handheld tablets (e.g., iPads) were more effective than head-mounted displays — possibly because tablets are less isolating and allow social learning
AR that required physical movement (walking around, tilting device) produced larger effects than stationary AR — consistent with embodied cognition theory
AR that provided immediate feedback (e.g., showing correct answer when student points at wrong object) outperformed AR that only displayed information passively
Short, focused AR activities (5–15 minutes) were more effective than longer sessions — suggesting AR is best used as a supplement, not a replacement for traditional instruction
**Learning theories that explain AR effectiveness:**
**Multimedia learning theory** (Mayer): AR reduces extraneous cognitive load by presenting information in the same spatial and temporal context as the real object, rather than requiring students to split attention between a textbook and a physical specimen
**Experiential learning theory** (Kolb): AR allows students to "learn by doing" in authentic contexts, rather than reading abstract descriptions
**Animate vision theory** (Ballard): The brain processes visual information differently when the viewer is moving through the environment (as with AR) versus viewing static images — movement enhances spatial memory
Effect magnitude
An effect size of 0.56 means that if you randomly pick a student from the AR group and a student from the control group, the AR student will score higher about **65% of the time** (assuming normal distributions). In practical terms:
**For a typical 20-question science test** where the control group averages 12/20 (60%) with a standard deviation of 4 points, the AR group would average about **14/20 (70%)** — roughly two additional correct answers
**This is roughly equivalent** to the difference between a student who studied for 30 minutes versus a student who studied for 60 minutes, based on typical education effect sizes
**However, the range is enormous** — in the best-case study, AR students scored nearly two standard deviations higher (equivalent to going from a C to an A), while in the worst-case study, AR students scored slightly *lower* than controls
**Important caveat:** These are immediate post-test effects. No study measured retention at 1 week, 1 month, or 1 year. The novelty of AR may inflate short-term scores without improving long-term learning.
Limitations
### What the Authors Acknowledge
Only 7 of 87 articles provided sufficient data for meta-analysis — the quantitative findings are based on a small, possibly unrepresentative sample
Many studies did not use standardised tests, making cross-study comparison difficult
The review is limited to K-12 education; findings may not generalise to adult learners or professional training
Publication bias is likely — studies with null results are less likely to be published
The qualitative analysis of design features is subjective and not statistically tested
### What a Critical Reader Would Note
**No conflict of interest statement** — some studies may have been funded by AR hardware or software companies
**No assessment of study quality** — the authors did not use a risk-of-bias tool (e.g., Cochrane RoB) to evaluate individual studies. Some included studies may have poor methodology (no randomisation, no control for teacher effects, small samples)
**No moderator analysis** — the authors do not statistically test whether effect sizes differ by age group, subject matter, hardware type, or study quality. The qualitative insights are suggestive but not confirmed
**No control for novelty effects** — AR is still relatively new in classrooms. Students may perform better simply because the technology is exciting, not because it improves learning. This effect typically fades after 2–4 weeks of use
**No long-term follow-up** — all studies measured immediate post-test performance. There is no evidence that AR improves retention or transfer of learning
**Teacher training not controlled** — teachers in AR conditions may have received extra training or support, creating a confound (the "extra attention" effect rather than the AR effect)
**Hardware limitations** — many studies used early AR systems (2010–2013) with poor tracking, low-resolution displays, or heavy headsets. Modern AR (e.g., HoloLens 2, iPad Pro with LiDAR) may produce different results
**No cost-benefit analysis** — AR requires expensive hardware and software development. The review does not address whether the moderate effect size justifies the cost compared to cheaper interventions (e.g., better textbooks, teacher training, or hands-on labs)
Practical takeaways
For someone running their own n=1 experiment (or a small classroom experiment):
### What to Test
**Specific intervention:** Use a tablet-based AR app that overlays 3D models or annotations onto real-world objects relevant to what you're learning. For example:
- Learning anatomy: Point tablet at a skeleton model to see muscle layers overlaid
- Learning physics: Point tablet at a ramp to see velocity vectors and force diagrams
- Learning geography: Point tablet at a physical map to see elevation contours and population data
**Dose:** 10–15 minute AR sessions, 2–3 times per week, for at least 4 weeks (to move beyond novelty effects)
**Comparator:** Same content delivered via textbook diagrams, 2D videos, or physical models without AR
### Minimum Meaningful Duration
**At least 4 weeks** — shorter durations are dominated by novelty effects. The first 1–2 weeks of AR use will likely show inflated performance due to excitement, not learning
**Test at 3 time points:** Pre-test (before any AR), post-test (immediately after 4 weeks), and retention test (2–4 weeks after stopping AR). The retention test is the most important — does AR improve long-term memory?
### What to Measure
**Primary metric:** Score on a standardised knowledge test (create your own or use existing curriculum tests). Aim for at least 20 questions covering both factual recall (e.g., "What is the function of the mitochondria?") and conceptual understanding (e.g., "Why does a heavier object fall at the same rate as a lighter object?")
**Secondary metrics:**
- Time to complete the test (AR may improve speed of recall)
- Self-reported engagement or interest (1–5 scale after each session)
- Number of errors during AR use (e.g., how many times did the student point at the wrong object?)
- Retention at 2 weeks and 4 weeks post-intervention
**Confound checks:**
- Prior knowledge (pre-test score)
- Tech comfort (self-rated 1–5)
- Time spent studying outside the experiment (daily log)
### Key Confounds to Control For
1. **Novelty effect:** The first 1–2 weeks of AR use will likely show inflated scores. Extend the experiment to at least 4 weeks and compare Week 1–2 scores to Week 3–4 scores
2. **Teacher/experimenter bias:** If you're both the teacher and the experimenter, you may unconsciously teach the AR group better. Use a scripted lesson plan for both conditions, or have a blind assistant administer the test
3. **Content difficulty:** Ensure the AR and control groups learn the exact same content. If AR covers more material or goes into greater depth, any improvement is due to content, not delivery method
4. **Time on task:** Measure and equalise the total time spent learning. If AR takes longer to set up or use, the AR group may benefit from more total study time, not from AR itself
5. **Hardware distractions:** AR tablets can be distracting (notifications, games). Use a dedicated device with all non-AR apps disabled, or use airplane mode
6. **Social learning:** AR on tablets can be used alone or in pairs. If the AR group works in pairs and the control group works alone, social interaction (not AR) may drive improvement. Keep group sizes consistent
### What a Positive Result Would Look Like
**Immediate post-test:** AR group scores 10–20% higher than control group (e.g., 70% vs. 60% on a 20-question test)
**Retention test (2–4 weeks later):** AR group still scores 5–15% higher than control group. If the AR advantage disappears at retention,