Meta-analysisWikiLanguage LearningHigh evidence score

The Effectiveness of Second Language Pronunciation Instruction: A Meta-Analysis

Authors: Junkyu Lee, Juhyun Jang, Luke Plonsky
Journal: Applied Linguistics
Year: 2014
DOI: 10.1093/applin/amu040
Citations: 414

TL;DR

Pronunciation instruction produces a large overall effect (Cohen's d = 0.89) on second language learners' pronunciation accuracy, with longer interventions, those that include feedback, and those using controlled outcome measures showing the largest gains — meaning if you want to improve your accent in a new language, structured practice with corrective feedback over several weeks is far more effective than casual exposure alone.

What they tested

This is a meta-analysis, meaning the authors did not run a single experiment. Instead, they systematically collected and statistically combined the results of 86 separate experimental and quasi-experimental studies that tested the effectiveness of pronunciation instruction (PI) for second language learners.

**The intervention (what was tested):** Pronunciation instruction — any explicit teaching or training designed to improve how learners produce the sounds, stress, rhythm, and intonation of a second language. This included:

Explicit phonetic training (e.g., teaching tongue placement for specific sounds)

Perception-based training (e.g., listening discrimination tasks)

Production-based practice (e.g., repetition drills, reading aloud)

Computer-assisted pronunciation training (e.g., software with speech recognition)

Instruction that included corrective feedback (e.g., teacher or peer correction of mispronunciations)

Instruction that did not include feedback (e.g., simple exposure or self-study without correction)

**Comparators:** Each primary study compared a group that received pronunciation instruction against either:

A control group that received no instruction (between-group comparisons), or

A pre-test score from the same group before instruction began (within-group comparisons, i.e., pre-post designs)

**Outcome measures:** Pronunciation accuracy was measured in several ways:

**Controlled measures:** Reading aloud word lists, sentences, or passages — tasks where the learner knows exactly what to say

**Spontaneous measures:** Free speech tasks like describing a picture, telling a story, or engaging in conversation — tasks where the learner must generate language in real time

**Global ratings:** Holistic judgments by native or trained raters (e.g., "How accented does this speaker sound on a 1–9 scale?")

**Acoustic measures:** Physical properties of speech such as voice onset time, formant frequencies, or vowel duration measured by software

**Why this matters for a self-experimenter:** The meta-analysis tells you what kind of practice works best, for how long, and under what conditions — so you can design your own pronunciation training regimen based on evidence rather than guesswork.

Who was studied

The meta-analysis included 86 primary studies published between 1982 and 2013. The total sample across all studies was approximately 3,200 learners (exact N not reported in the abstract, but the paper states the median sample size per study was 32 participants).

**Population characteristics:**

**Age:** Mostly university-aged adults (18–30 years old), though a few studies included children or older adults

**Language backgrounds:** Learners of English as a second/foreign language (the vast majority), plus a smaller number of studies on learners of French, Spanish, Japanese, Korean, and Mandarin

**Proficiency levels:** Ranged from beginner to advanced, with most studies focusing on intermediate learners

**First languages:** Highly diverse — included speakers of Chinese, Japanese, Korean, Spanish, Arabic, German, French, and many others

**Setting:** Classroom-based instruction in university language programs (most common), plus some laboratory-based training studies and a few self-study computer programs

**Duration of instruction:** Ranged from a single 20-minute session to a full academic semester (approximately 16 weeks), with a median intervention length of about 4–6 weeks

**What this means for generalisability:** The findings are most applicable to motivated adult learners in structured educational settings. If you are a self-directed learner outside a classroom, the results still apply, but you will need to adapt the methods to your own context.

How they measured it

The meta-analysis did not use a single instrument — it aggregated effect sizes from many different measurement tools used across the primary studies. The key metric was **Cohen's d**, a standardised measure of effect size that expresses the difference between groups (or pre-post) in standard deviation units.

**Effect size calculation:**

For between-group designs: d = (mean of instruction group − mean of control group) / pooled standard deviation

For within-group designs: d = (post-test mean − pre-test mean) / pre-test standard deviation (adjusted for correlation between pre and post)

**Moderator variables coded (to explain variance in effects):**

**Length of intervention:** Coded as total hours of instruction (e.g., < 1 hour, 1–5 hours, 5–10 hours, > 10 hours)

**Type of outcome measure:** Controlled vs. spontaneous (as described above)

**Presence of feedback:** Whether the instruction included explicit corrective feedback (yes/no)

**Setting:** Classroom vs. laboratory

**Target of instruction:** Segmental (individual sounds like /r/ vs. /l/) vs. suprasegmental (stress, rhythm, intonation)

**Learner proficiency:** Beginner, intermediate, advanced

**Study design quality:** Randomised vs. non-randomised, presence of control group, blinding of raters

**Why the measurement approach matters:** By using Cohen's d, the authors could combine results from studies that used completely different outcome measures (e.g., acoustic analysis of vowel formants vs. native-speaker ratings of accent). This is both a strength and a weakness — it allows synthesis across diverse studies, but it also means the "effect" is an abstract statistical construct rather than a concrete, real-world unit like "percentage of correctly produced sounds."

Methodology

**Study design:** This is a meta-analysis — a statistical synthesis of existing experimental and quasi-experimental studies. The authors followed standard meta-analytic procedures as outlined by Lipsey and Wilson (2001) and the PRISMA guidelines for systematic reviews.

**Search strategy:**

Searched 10 electronic databases (e.g., ERIC, Linguistics and Language Behavior Abstracts, PsycINFO, ProQuest Dissertations)

Hand-searched 15 relevant journals (e.g., Applied Linguistics, Studies in Second Language Acquisition, Language Learning)

Examined reference lists of previous reviews and retrieved articles

Contacted researchers in the field for unpublished studies

Inclusion criteria: (1) tested the effects of pronunciation instruction, (2) reported sufficient data to calculate an effect size, (3) published in English, (4) involved second/foreign language learners

This yielded 86 unique reports (journal articles, dissertations, book chapters) published between 1982 and 2013

**Coding procedure:**

Each study was coded by two independent coders for substantive features (e.g., intervention type, outcome measure, learner characteristics) and methodological features (e.g., design quality, sample size, attrition)

Inter-coder reliability was assessed using Cohen's kappa, with values ranging from 0.78 to 0.96 across different coding categories (acceptable to excellent)

Disagreements were resolved through discussion

**Statistical analysis:**

Effect sizes were calculated using Cohen's d with correction for small sample bias (Hedges' g)

Random-effects models were used (rather than fixed-effects) because the studies varied in their populations, interventions, and settings — meaning the true effect likely varies across studies

Heterogeneity was assessed using the Q-statistic and I² (percentage of variance attributable to real differences between studies rather than sampling error)

Moderator analyses were conducted using analogue-to-ANOVA (for categorical moderators) and meta-regression (for continuous moderators)

Publication bias was assessed using funnel plots and Egger's regression test

**What this design can prove:**

The overall average effect of pronunciation instruction across many different contexts

Which features of instruction (duration, feedback, outcome type) are associated with larger or smaller effects

The degree of variability in effects across studies

**What this design cannot prove:**

Causality — because the meta-analysis combines experimental and quasi-experimental studies, and because moderator analyses are correlational, you cannot definitively say "feedback causes larger effects" (studies that provide feedback may also differ in other ways, such as having more motivated learners or better teachers)

Individual-level predictions — the results tell you about average effects across groups, not what will happen for any single person

The optimal dose for a specific learner — the meta-analysis can tell you that longer interventions produce larger effects on average, but it cannot tell you the exact number of hours you personally need

**Major methodological strengths:**

Comprehensive search including unpublished studies (reduces publication bias)

Double-coding with good reliability

Use of random-effects models (appropriate for heterogeneous data)

Examination of multiple moderators

**Major methodological weaknesses:**

Most primary studies were small (median N = 32) and many lacked random assignment

Few studies included delayed post-tests to measure long-term retention

The quality of pronunciation instruction varied enormously across studies — some used well-designed curricula, others used ad-hoc materials

Publication bias was detected (funnel plot asymmetry), meaning the true effect may be somewhat smaller than reported

The meta-analysis is now over a decade old (published 2014), and more recent studies may have different findings

Key findings

**Overall effect of pronunciation instruction:**

**Between-group comparisons** (instruction vs. no instruction): Overall weighted mean effect size d = 0.80 (95% CI: 0.68–0.92), p < 0.001

**Within-group comparisons** (pre-test vs. post-test): Overall weighted mean effect size d = 0.89 (95% CI: 0.78–1.00), p < 0.001

Both are considered "large" effects by conventional benchmarks (Cohen's d: 0.2 = small, 0.5 = medium, 0.8 = large)

**Moderator analyses (what made instruction more or less effective):**

**Length of intervention:** Longer interventions produced larger effects. Studies with > 10 hours of instruction showed d = 1.12, compared to d = 0.62 for studies with < 5 hours. This difference was statistically significant (Q = 8.42, p < 0.01)

**Presence of feedback:** Instruction that included explicit corrective feedback produced larger effects (d = 1.02) than instruction without feedback (d = 0.68). This difference was significant (Q = 6.91, p < 0.01)

**Type of outcome measure:** Controlled measures (reading aloud) showed larger effects (d = 1.04) than spontaneous measures (free speech) (d = 0.64). This difference was significant (Q = 12.34, p < 0.001)

**Target of instruction:** Suprasegmental instruction (stress, rhythm, intonation) showed slightly larger effects (d = 0.94) than segmental instruction (individual sounds) (d = 0.78), but this difference was not statistically significant (Q = 1.89, p = 0.17)

**Setting:** Laboratory studies showed larger effects (d = 1.08) than classroom studies (d = 0.74), but this may reflect tighter control and more intensive practice in lab settings rather than a true advantage of the setting itself

**Learner proficiency:** No significant difference between beginners (d = 0.84), intermediate (d = 0.88), and advanced learners (d = 0.76) — instruction helped at all levels

**Study design quality:** Randomised studies showed slightly smaller effects (d = 0.76) than non-randomised studies (d = 0.92), suggesting that better-controlled studies find more modest effects

**Heterogeneity:**

The Q-statistic was significant (p < 0.001) and I² was approximately 65%, indicating moderate-to-high heterogeneity — meaning the effects varied substantially across studies, and the moderator analyses only partially explained this variation

**Publication bias:**

Funnel plot analysis showed asymmetry, with a slight over-representation of large positive effects in published studies

The trim-and-fill method (which estimates what the effect would be if missing studies were included) suggested the true between-group effect might be d = 0.72 rather than d = 0.80 — still a medium-to-large effect

Effect magnitude

**In plain English:**

The average learner who receives pronunciation instruction scores about 0.8 standard deviations higher on pronunciation accuracy than the average learner who does not receive instruction

To put this in more concrete terms: if you imagine 100 learners who receive instruction and 100 who do not, the average instructed learner would outperform about 79% of the uninstructed learners

Alternatively, if pronunciation accuracy is measured on a 1–7 scale (where 7 is native-like), and the standard deviation is about 1.5 points, then an effect of d = 0.80 translates to roughly a 1.2-point improvement — the difference between "clearly foreign-accented but mostly intelligible" and "near-native with only occasional errors"

For the within-group comparison (d = 0.89), this means the average learner improves by about 1.3 standard deviations from pre-test to post-test — a substantial gain

**What the moderator effects mean in practical terms:**

**Longer vs. shorter instruction:** The difference between < 5 hours (d = 0.62) and > 10 hours (d = 1.12) is roughly a doubling of the effect. If you practice for 10+ hours spread over several weeks, you can expect about twice the improvement compared to practicing for less than 5 hours

**With vs. without feedback:** The difference between feedback (d = 1.02) and no feedback (d = 0.68) means that getting corrective feedback adds about 50% more improvement compared to practice alone

**Controlled vs. spontaneous measures:** The gap between controlled (d = 1.04) and spontaneous (d = 0.64) measures means that you will show much larger improvements when reading aloud than when speaking freely — and that real-world conversational gains will be more modest

Limitations

**Acknowledged by the authors:**

The meta-analysis could not control for the quality of instruction across studies — some programs were well-designed, others were not

Few studies included delayed post-tests, so long-term retention of pronunciation gains is unclear

The search was limited to studies published in English, potentially missing relevant research in other languages

Publication bias was detected, suggesting the true effect may be somewhat smaller than reported

The number of studies for some moderator analyses was small (e.g., only 8 studies examined suprasegmental instruction specifically)

**Additional limitations a critical reader would note:**

**Age of the meta-analysis:** Published in 2014, the search likely ended in 2012–2013. Technology for pronunciation training (especially apps and AI-based feedback) has advanced considerably since then

**Lack of individual difference analysis:** The meta-analysis did not examine how factors like musical ability, working memory, or motivation might moderate effects — all of which are known to influence pronunciation learning

**Mostly English as a target language:** The findings may not generalise equally to learning pronunciation in languages with very different sound systems (e.g., tonal languages like Mandarin or Thai)

**Self-selection bias in primary studies:** Learners who volunteer for pronunciation studies may be more motivated than the average language learner

**Rater bias:** Many studies used native-speaker raters who were not blind to condition, potentially inflating effects

**No analysis of optimal feedback type:** The meta-analysis coded whether feedback was present or absent, but did not distinguish between different types of feedback (e.g., explicit correction vs. recasts vs. metalinguistic explanation)

**Small sample sizes in primary studies:** With a median N of 32, many individual studies were underpowered to detect moderate effects, which can lead to inflated effect sizes in the meta-analysis (the "winner's curse")

Practical takeaways

For someone running their own n=1 experiment to improve pronunciation in a second language:

### What to test (specific intervention and dose)

**Intervention:** Structured pronunciation practice with explicit corrective feedback. Use a combination of:

- **Per

Read full paper →More Language Learning research