Effectiveness of L2 Instruction: A Research Synthesis and Quantitative Meta‐analysis
Read full paper →- Authors
- John M. Norris, Lourdes Ortega
- Journal
- Language Learning
- Year
- 2000
- Citations
- 2,382
TL;DR
Focused second‑language (L2) instruction produces large, durable gains compared to no instruction or natural exposure alone, with explicit teaching methods (e.g., rule explanations, corrective feedback) roughly twice as effective as implicit methods (e.g., flooding input with target forms), and both “Focus on Form” (drawing attention to grammar during communication) and “Focus on Forms” (isolated grammar lessons) work equally well.
What they tested
The meta‑analysis compared the effectiveness of **L2 instruction** (any planned pedagogical intervention targeting specific linguistic features) against **no instruction** or **natural exposure only** (e.g., immersion without explicit teaching). The interventions were categorised into three broad types:
**Explicit instruction:** Learners were told rules, given metalinguistic explanations, or received overt error correction. Examples: grammar‑translation lessons, rule‑presentation drills, explicit corrective feedback.
**Implicit instruction:** Learners were exposed to target forms without being told rules. Examples: input flooding (many examples of a structure), recasts (repeating a learner’s error correctly), or text enhancement (bolding or underlining target forms).
**Focus on Form (FonF):** Attention to linguistic form arises incidentally during communicative activities (e.g., a teacher briefly explains a grammar point during a conversation task).
**Focus on Forms (FonFs):** Instruction isolates linguistic forms for explicit practice (e.g., a dedicated lesson on past tense –ed endings, followed by drills).
Outcome measures included:
**Comprehension/recognition tests** (e.g., multiple‑choice grammar tests)
**Production tests** (e.g., written or spoken sentence completion, oral interviews)
**Metalinguistic judgment tests** (e.g., “Is this sentence correct? Why?”)
**Delayed post‑tests** (administered 2–12 weeks after instruction ended) to measure durability.
Who was studied
The meta‑analysis synthesised **49 unique sample studies** (published between 1980 and 1998) involving a total of **~2,300 learners** (exact total N not reported in the abstract, but individual studies ranged from ~15 to ~120 participants). Learners were:
Mostly university‑aged adults (18–25 years old)
Learning English as a second/foreign language (ESL/EFL) in classroom settings (USA, Japan, Spain, Canada, Netherlands, etc.)
A minority studied French, Spanish, German, or Japanese
Proficiency ranged from beginner to intermediate; no advanced learners were included
All were enrolled in formal language courses (not self‑study or immersion alone)
How they measured it
Effect sizes were calculated using **Cohen’s d** (standardised mean difference) for each study. The primary metric was the difference between instructed and uninstructed groups on post‑test scores, divided by the pooled standard deviation. For studies with multiple outcome measures, the authors extracted data separately for:
**Immediate post‑tests** (within 1 week of instruction)
**Delayed post‑tests** (2–12 weeks later)
**Type of outcome** (comprehension vs. production vs. metalinguistic knowledge)
They also coded each study for:
**Instructional explicitness** (explicit vs. implicit)
**Instructional focus** (FonF vs. FonFs)
**Duration of treatment** (ranged from 1 session to 16 weeks)
**Setting** (foreign language vs. second language context)
Methodology
**Design:** This is a **meta‑analysis** — a statistical synthesis of 49 independent experimental and quasi‑experimental studies. The authors followed systematic review protocols: they searched multiple databases (ERIC, LLBA, PsycINFO), hand‑searched 10 key journals, and contacted researchers for unpublished data. They included only studies that:
1. Had a control group (no instruction or natural exposure only)
2. Reported enough data to calculate effect sizes (means, SDs, t‑values, or F‑values)
3. Measured learning of a specific linguistic target (e.g., a grammatical structure, not general proficiency)
**Statistical approach:** They used a **random‑effects model** (which assumes true effects vary across studies) rather than a fixed‑effect model. This is appropriate because the studies differed in populations, settings, and outcome measures. They calculated:
Weighted mean effect sizes (d) with 95% confidence intervals
Homogeneity statistics (Q‑tests) to check whether effects varied more than expected by chance
Moderator analyses (ANOVA‑like comparisons) to test whether explicitness, focus type, or outcome measure predicted effect size
**What this design can and cannot prove:**
**Can prove:** That, on average across many studies, instruction produces larger gains than no instruction. It can also show that explicit instruction tends to produce larger effects than implicit instruction, and that FonF and FonFs are similarly effective.
**Cannot prove:** Causality within any single study (because it’s a synthesis, not an experiment). It cannot tell you which specific teaching technique works best for a given learner or linguistic feature — only general trends. It also cannot rule out publication bias (studies with null results may be missing), though the authors tested for this using a funnel plot and found no strong evidence of bias.
**Major methodological weaknesses:**
**Heterogeneity:** Even after grouping by explicitness, effect sizes varied widely (Q‑tests were significant), meaning that “explicit instruction” covers many different practices (e.g., rule presentation vs. error correction) that may work differently.
**Short durations:** Most studies lasted only 1–4 sessions; only 8 studies had treatments longer than 8 weeks. Durability findings are based on delayed post‑tests, but many studies did not include them.
**Outcome measure bias:** Studies using metalinguistic judgment tests (e.g., “Is this correct?”) showed much larger effects than studies using free production tests (e.g., spontaneous speech). This suggests that instruction may boost explicit knowledge more than implicit, automatic use.
**Lack of replication:** Few studies directly replicated each other; the meta‑analysis had to combine studies with different targets (e.g., English articles, French verb tenses, Spanish clitics), which may not be comparable.
Key findings
**Primary outcome: Overall effectiveness of instruction**
Weighted mean effect size for instructed vs. uninstructed groups: **d = 1.02** (95% CI: 0.84–1.20, p < 0.001)
This is a **large effect** (Cohen’s convention: d = 0.8 is large). In plain terms, the average instructed learner scored about **1 standard deviation higher** than the average uninstructed learner on post‑tests.
**Explicit vs. implicit instruction**
Explicit instruction: **d = 1.13** (95% CI: 0.92–1.34, k = 30 studies)
Implicit instruction: **d = 0.54** (95% CI: 0.31–0.77, k = 19 studies)
The difference between explicit and implicit was statistically significant (Q‑between = 12.4, p < 0.001). Explicit instruction produced **roughly double the effect** of implicit instruction.
**Focus on Form vs. Focus on Forms**
FonF: **d = 1.00** (95% CI: 0.72–1.28, k = 18 studies)
FonFs: **d = 1.03** (95% CI: 0.80–1.26, k = 31 studies)
The difference was **not statistically significant** (Q‑between = 0.03, p = 0.86). Both approaches produced large, equivalent effects.
**Durability (delayed post‑tests)**
For studies that included delayed post‑tests (2–12 weeks later), the mean effect remained large: **d = 0.91** (95% CI: 0.68–1.14, k = 16 studies)
This suggests that gains from instruction are largely retained, at least over a few weeks.
**Outcome measure type**
Metalinguistic judgment tests: **d = 1.43** (largest effects)
Controlled production tests (e.g., sentence completion): **d = 1.02**
Free production tests (e.g., oral narratives): **d = 0.62**
The difference between metalinguistic and free production was significant (Q‑between = 8.9, p = 0.003). Instruction boosted explicit knowledge more than spontaneous use.
**Duration of treatment**
Studies with 1–4 sessions: **d = 1.10**
Studies with 5–16 sessions: **d = 0.95**
The difference was **not significant** (p = 0.32). Even short bursts of instruction were effective.
Effect magnitude
**Overall:** The average instructed learner scored at the **84th percentile** of the uninstructed group (assuming normal distribution). That is, if you take 100 learners who receive no instruction, the average instructed learner would outperform 84 of them.
**Explicit vs. implicit:** An explicit instruction learner would outperform about **87%** of implicit instruction learners (d = 1.13 vs. 0.54). In practical terms, explicit instruction moved a learner from the 50th to the 87th percentile relative to implicit instruction.
**Durability:** After 2–12 weeks without instruction, the effect dropped only slightly (from d = 1.02 to d = 0.91), meaning about **82%** of the initial gain was retained.
**Outcome type:** On free production tests, the effect was still moderate (d = 0.62), meaning the average instructed learner outperformed about **73%** of uninstructed learners in spontaneous use — but this is notably smaller than the effect on metalinguistic tests (d = 1.43, outperforming 92% of controls).
Limitations
**Acknowledged by authors:**
**Operationalisation problems:** “Explicit” and “implicit” instruction were defined inconsistently across studies. Some studies labelled recasts as “implicit” but others as “explicit.” This weakens the comparison.
**Lack of replication:** Few studies directly replicated each other, so the meta‑analysis combines apples and oranges (different languages, targets, learners).
**Publication bias:** Although funnel‑plot tests were non‑significant, the authors note that null‑result studies are less likely to be published, which could inflate effect sizes.
**Short‑term focus:** Most studies measured learning immediately after instruction; only 16 of 49 studies included delayed post‑tests, and none followed learners beyond 12 weeks.
**Learner characteristics:** No studies examined individual differences (e.g., aptitude, motivation, age). The findings may not generalise to children or older adults.
**Critical reader notes:**
**Ecological validity:** Most studies were conducted in university classrooms with researcher‑designed tests. Real‑world language use (e.g., ordering food, making friends) was rarely measured.
**Control group contamination:** In many quasi‑experimental studies, control groups were also receiving some instruction (just not the target form). The “no instruction” condition was often just “business‑as‑usual teaching,” which may have included incidental exposure to the target.
**Effect size inflation:** Studies using metalinguistic tests (d = 1.43) may overestimate real‑world gains. If you only measure what you taught (e.g., a rule you explained), you’re likely to find a large effect.
**No dose‑response analysis:** The meta‑analysis did not test whether more instruction (e.g., 10 sessions vs. 2 sessions) produced larger effects — only a binary short/long comparison, which was non‑significant.
Practical takeaways
For someone running their own n=1 experiment to improve their L2 learning:
### What to test
**Explicit rule learning + corrective feedback** (e.g., study a grammar rule for 10 minutes, then practice with a partner who corrects your errors) vs. **implicit exposure only** (e.g., read a text with many examples of the rule, but no explanation or correction).
Or compare **Focus on Form** (e.g., during a conversation, pause to look up a grammar point) vs. **Focus on Forms** (e.g., a dedicated 20‑minute grammar drill before speaking).
### Minimum meaningful duration
**At least 2 weeks** of daily practice (10–14 sessions, each 15–30 minutes). The meta‑analysis found that even 1–4 sessions produced large effects, but for durable learning, aim for 2+ weeks.
Include a **delayed post‑test** at 2–4 weeks after stopping instruction to measure retention.
### What to measure
**Primary metric:** Score on a **controlled production test** (e.g., write 10 sentences using the target structure). This is the most common outcome in the meta‑analysis and gives a d ~1.0.
**Secondary metric:** Score on a **free production test** (e.g., record a 3‑minute monologue on a topic that requires the target structure). This is harder but more ecologically valid (d ~0.6).
**Optional:** A **metalinguistic judgment test** (e.g., “Is this sentence correct? Why?”) to measure explicit knowledge (d ~1.4).
Measure **before** the intervention (baseline), **immediately after**, and **2–4 weeks later**.
### Key confounds to control for
**Time on task:** Ensure both conditions get the same total exposure time to the target language (e.g., 30 minutes per day). If explicit instruction takes 10 minutes of rule study + 20 minutes of practice, the implicit condition should also get 30 minutes of input.
**Order effects:** If you’re comparing two methods, use a **crossover design** (e.g., 2 weeks of Method A, then 2 weeks of Method B, with a 1‑week washout). Randomise which method comes first.
**Target difficulty:** Choose a single, well‑defined linguistic feature (e.g., English past tense –ed, French passé composé, Spanish preterite vs. imperfect). Don’t compare instruction on easy vs. hard structures.
**Learner fatigue:** Keep sessions short (15–30 minutes) to avoid burnout. Track your energy/motivation daily (1–10 scale) and check if it correlates with outcomes.
**Practice effects:** Use parallel test forms (different sentences but same structure) for pre‑, post‑, and delayed tests to avoid memorisation.
### What a positive result would look like
**Immediate post‑test:** Your score improves by **at least 1 standard deviation** (e.g., from 40% correct to 70% correct on a controlled production test). This matches the meta‑analytic d = 1.02.
**Delayed post‑test:** Your score drops by **no more than 10–15%** (e.g., from 70% to 60–65% correct). This would indicate durable learning (d ~0.9).
**Free production test:** A smaller but still noticeable improvement — e.g., from using the target structure correctly 20% of the time to 40% of the time in a monologue (d ~0.6).
**Comparison between methods:** If you test explicit vs. implicit, expect explicit to produce **roughly double the gain** (e.g., explicit: +30 percentage points; implicit: +15 percentage points). If you test FonF vs. FonFs, expect **similar gains** for both.
**Bottom line:** For your self‑experiment, the strongest evidence supports **explicit instruction with corrective feedback** — study the rule, practice using it, and get someone to correct your errors. Do this for at least 2 weeks, measure controlled production before and after, and check again 2–4 weeks later. If you see a 30‑point