BookWikiLearningHigh evidence score

Make It Stick: The Science of Successful Learning

Authors: Peter C. Brown, Henry L. Roediger III, Mark A. McDaniel
Journal: Harvard University Press
Year: 2014
ISBN: 9780674729018

TL;DR

Common study habits like rereading, highlighting, and cramming create the illusion of mastery but produce rapid forgetting; instead, self-testing, spacing out practice, interleaving different topics, and embracing desirable difficulties produce learning that is more durable and transferable to new problems.

What they tested

This is not a single experiment but a synthesis of decades of cognitive psychology research. The authors tested (through reviewing hundreds of studies) the effectiveness of several learning strategies against conventional study methods:

**Interventions tested:**

- **Retrieval practice (self-testing):** Actively recalling information from memory rather than re-reading it.

- **Spaced practice:** Distributing study sessions over time rather than massing them (cramming).

- **Interleaving:** Mixing practice of different topics or skills within a single session, rather than blocking (practicing one topic completely before moving to the next).

- **Desirable difficulties:** Introducing challenges during learning (e.g., generating answers before being told, varying practice conditions) that slow initial acquisition but improve long-term retention.

- **Elaboration and generation:** Explaining new material in your own words and trying to solve problems before being shown the solution.

**Comparators:** Common but ineffective strategies including rereading, highlighting, underlining, massed practice (cramming), and blocked practice (practicing one skill repeatedly before switching).

**Outcome measures:** Performance on delayed tests (days to months later), transfer tests (applying knowledge to novel problems), and retention of skills (e.g., surgical techniques, sports moves, foreign language vocabulary).

Who was studied

The book synthesises findings from hundreds of studies involving:

**College students** (thousands across multiple experiments, typically aged 18–25) in laboratory and classroom settings.

**Medical students** learning surgical skills and diagnostic reasoning.

**Air Force pilots** learning complex navigation and combat procedures.

**Children** (elementary through high school) learning math, science, and vocabulary.

**Older adults** (aged 60+) learning new skills and factual knowledge.

**Athletes** (baseball players, golfers) learning motor skills.

The total sample across all studies reviewed is in the tens of thousands, though individual experiments typically ranged from 30 to 200 participants.

How they measured it

The authors draw on studies that used a variety of objective performance measures:

**Delayed recall tests:** Percentage of material correctly recalled after intervals ranging from 1 day to 9 months. For example, one study tested medical students on surgical knot-tying skills 1 month after training.

**Transfer tests:** Ability to apply learned principles to problems never seen before. For example, students who learned physics concepts through interleaved practice were tested on novel problem types.

**Accuracy and speed:** For motor skills (e.g., baseball batting, surgical suturing), researchers measured error rates and completion times.

**Judgments of learning (JOLs):** Participants rated how well they thought they had learned material. These subjective ratings were compared against actual test performance to measure metacognitive accuracy.

**Retention curves:** Researchers plotted forgetting rates over time to compare how quickly knowledge decayed under different study conditions.

Methodology

### Study design

This is a **narrative synthesis and integration** of hundreds of individual experiments, most of which were **randomised controlled trials (RCTs)** conducted in laboratory settings, with some field experiments in classrooms and training programs. The book is not a formal meta-analysis (it does not pool effect sizes statistically), but it systematically reviews the converging evidence across multiple labs and populations.

### Key design features of the underlying studies

**Randomisation:** In the core experiments, participants were randomly assigned to different learning conditions. For example, in a typical retrieval practice study, one group repeatedly read a passage while another group read it once then took a recall test. Random assignment ensures that pre-existing differences between groups (e.g., IQ, prior knowledge) are evenly distributed.

**Blinding:** Most studies were **single-blind** — participants did not know which condition was hypothesised to be superior. However, it is impossible to fully blind participants to whether they are being tested versus rereading. Some studies used **yoked designs** where the experimenter was blind to condition during scoring.

**Duration:** Training sessions typically lasted 20–60 minutes. The critical measure was **delayed testing**, with retention intervals ranging from 1 day to 9 months. The longest follow-ups in the reviewed studies were approximately 1 year.

**Control conditions:** The standard control was "massed practice" (cramming) or "blocked practice" (practicing one skill repeatedly). Some studies also compared against "study as usual" where participants used their preferred methods.

### What this design can and cannot prove

**Can prove:**

That retrieval practice, spacing, and interleaving produce **superior long-term retention** compared to rereading and cramming, because the RCT design controls for confounds like time spent studying and participant motivation.

That these effects are **reliable across many populations and materials** (words, facts, concepts, motor skills), because the findings replicate across dozens of independent labs.

**Cannot prove:**

That these strategies work equally well for **all individuals** — most studies report group averages, and individual variation is substantial.

That these strategies are **always superior in the short term** — in fact, they often feel harder and produce worse performance during initial learning, which is why students and teachers often reject them.

That the effects are **specific to the exact protocols used** — the optimal spacing interval, number of retrieval attempts, and degree of interleaving likely vary by material and learner.

### Major methodological strengths

Converging evidence from **multiple labs** using different materials and populations.

**Objective performance measures** rather than self-report.

**Long-term follow-up** (weeks to months) rather than immediate testing.

### Major methodological weaknesses

Most studies were **short-term laboratory experiments** — fewer than 20% were conducted in real classrooms over a full semester.

**Publication bias** is possible: studies showing null effects of retrieval practice may not have been published.

**Lack of individual difference analysis** — few studies examined whether some learners benefit more or less from these strategies.

**No blinding of participants** to the learning condition (impossible to hide whether you are being tested or rereading).

Key findings

### Retrieval practice (self-testing)

In a classic study (Roediger & Karpicke, 2006), students who read a passage once then took three recall tests remembered **80% of the material after 1 week**, compared to **36%** for students who read the passage four times. That is a **44 percentage point advantage** — more than double the retention.

In a medical education study, surgical residents who practiced knot-tying through self-testing (trying to tie the knot from memory) performed **significantly better on a 1-month retention test** than residents who simply watched a demonstration and practiced with the instructions available (p < 0.01, effect size d = 0.89).

**Metacognitive illusion:** Students who reread rated their learning as higher (average 4.2/5) than those who self-tested (3.1/5), yet the self-testers scored **twice as high** on the delayed test. This mismatch between perceived and actual learning is a core finding.

### Spaced practice

A meta-analysis of 254 studies on spacing effects (Cepeda et al., 2006) found that spaced practice produced **effect sizes of d = 0.45 to 0.85** compared to massed practice, depending on the retention interval. For a retention interval of 1 month, the optimal spacing gap between study sessions was approximately **10–20% of the retention interval** (e.g., study 3–6 days apart for a 1-month test).

In a study of foreign language vocabulary, students who studied words in 3 sessions spaced 1 day apart remembered **65% after 1 week**, while those who studied in 3 sessions on the same day remembered only **28%** — a 37 percentage point difference.

### Interleaving

In a study of college students learning to identify paintings by different artists (Kornell & Bjork, 2008), students who studied **interleaved** examples (mixing artists within a session) scored **65% correct** on a transfer test, compared to **45%** for students who studied **blocked** examples (all paintings by one artist, then the next). Despite this, 68% of students believed the blocked condition was more effective.

In a study of baseball players learning different types of pitches, interleaved batting practice produced **57% correct hits** in a game simulation, versus **25%** for blocked practice — a 32 percentage point advantage.

### Desirable difficulties

**Generation effect:** Students who tried to generate an answer before being told (e.g., "What is the capital of Peru?" before being shown "Lima") remembered **28% more** on a delayed test than students who simply read the fact (p < 0.01).

**Varying practice conditions:** Golfers who practiced putting from 10 different distances (3–15 feet) improved their putting accuracy by **22%** on a test from a novel distance, compared to **8%** for golfers who practiced only from 10 feet. The varied-practice group performed worse during training (fewer putts made) but dramatically better on the transfer test.

### Ineffective strategies

**Rereading:** Produces **negligible gains** beyond the first reading. In multiple studies, rereading a text 4 times versus 1 time improved delayed recall by only **5–10%** , while a single self-test after one reading improved recall by **40–50%** .

**Highlighting and underlining:** Does not improve performance on delayed tests compared to simply reading. One study found that students who highlighted remembered **no more** than students who did not highlight, and that highlighting actually **impaired** performance on questions requiring inference (p < 0.05).

**Cramming (massed practice):** Produces rapid forgetting. Students who crammed for an exam remembered **70%** the next day but only **20%** after 1 week, while spaced learners remembered **60%** after 1 week.

Effect magnitude

**In plain English:** Switching from rereading to self-testing roughly **doubles** the amount you remember after a week. Spacing your study sessions instead of cramming produces a **50–80% improvement** in long-term retention. Interleaving different topics rather than blocking them improves transfer to new problems by **30–50%** .

**Concrete example:** If you normally study for 3 hours the night before an exam (cramming) and remember 20% of the material a week later, switching to 3 one-hour sessions spaced 3 days apart would likely boost that to 50–60% retention — without any additional total study time.

**Cost of the effect:** These strategies feel harder. Students typically rate retrieval practice as **less effective** than rereading during the learning session, even though it produces dramatically better long-term results. This "effortful retrieval" is the mechanism — the difficulty signals that deep encoding is occurring.

Limitations

### What the authors acknowledge

**Initial performance suffers:** Retrieval practice, spacing, and interleaving all produce worse performance during training compared to massed or blocked practice. This makes them counterintuitive and hard to adopt.

**Optimal schedules are unknown:** The ideal spacing interval, number of retrieval attempts, and degree of interleaving likely vary by material, learner, and desired retention interval. There is no one-size-fits-all prescription.

**Individual differences exist:** Some learners may benefit more from these strategies than others, though the evidence does not yet identify who.

**Classroom implementation is challenging:** Teachers and students often resist these methods because they feel harder and produce worse short-term performance on quizzes.

### What a critical reader would note

**Most studies are short-term:** The longest follow-up in most experiments is 1–9 months. Whether these effects persist for years is less well-established.

**Laboratory vs. real-world gap:** Many studies used artificial materials (word lists, fictional facts) rather than authentic course content. Classroom studies exist but are fewer.

**Publication bias:** Studies showing null or negative effects of retrieval practice are less likely to be published. The true effect size may be smaller than reported.

**No blinding:** Participants know whether they are being tested or rereading, which could introduce demand characteristics or placebo effects.

**Self-report vs. objective measures:** Students' judgments of learning are consistently wrong, meaning they may abandon effective strategies because they feel less productive.

**Industry funding:** The book is published by a university press and the authors are academic researchers, so industry bias is minimal. However, the authors have a clear advocacy position for these strategies.

**Population limits:** Most studies used college students in Western countries. Whether these findings generalise to children, older adults, or non-Western populations is less certain, though the available evidence suggests they do.

Practical takeaways

For someone running their own n=1 experiment to improve learning:

### What to test

**Primary intervention:** Replace rereading with **retrieval practice** (self-testing). After reading a chapter or watching a lecture, close the book and try to recall the key points from memory. Then check your accuracy.

**Secondary intervention:** Implement **spaced practice** by scheduling review sessions at increasing intervals (1 day, 3 days, 1 week, 1 month after initial learning).

**Tertiary intervention:** Try **interleaving** by mixing practice of different topics within a single study session (e.g., alternate between algebra, geometry, and statistics problems rather than doing all algebra first).

### Minimum meaningful duration

**For retrieval practice:** You will see measurable differences after **1 week** of daily self-testing versus rereading. The effect grows larger over longer periods (1–3 months).

**For spacing:** You need at least **3 study sessions** spread over **1–2 weeks** to see the spacing advantage over cramming.

**For interleaving:** A single session of interleaved practice can show effects on a **1-day delayed test**, but the advantage becomes clearer after **1 week**.

### What to measure

**Primary metric:** Performance on a **delayed recall test** (e.g., 1 week after the last study session). Use the same test questions for both conditions.

**Secondary metric:** **Transfer test** — can you apply the knowledge to a problem you have never seen before? For example, if learning physics, can you solve a novel problem type?

**Process metric:** **Time spent studying** — track total study time to ensure any differences are not due to spending more or less time.

**Subjective metric:** Rate your **confidence** in your learning on a 1–10 scale immediately after studying, then compare it to your actual test score. This reveals the metacognitive illusion.

### Key confounds to control for

**Total study time:** Keep it constant across conditions. If you normally study for 2 hours, spend 2 hours in both the rereading and self-testing conditions.

**Time of day:** Study at the same time of day for both conditions.

**Sleep:** Ensure you get similar sleep quality and quantity before the delayed test (sleep consolidates memory).

**Prior knowledge:** Use material you have never studied before, or counterbalance the order of conditions.

**Test format:** Use the same test format (multiple choice, short answer, or essay) for both conditions.

**Motivation:** Track your motivation and interest — if one condition feels more engaging, that could confound results.

### What a positive result would look like

**Retrieval practice:** You score **at least 30% higher** on a 1-week delayed test after self-testing compared to rereading, despite spending the same total study time and feeling less confident during learning.

**Spaced practice:** You score **at least 20% higher** on a 1-month delayed test after 3 spaced sessions compared to

Buy on Amazon →More Learning research