Systematic ReviewWikiLanguage Learning offline_policy_evaluationHigh evidence score

Replication in Second Language Research: Narrative and Systematic Reviews and Recommendations for the Field

Authors: Emma Marsden, Kara Morgan‐Short, Sophie Thompson‐Lee, David Abugaber
Journal: Language Learning
Year: 2018
DOI: 10.1111/lang.12286
Citations: 269

TL;DR

Only about 1 in every 400 second language research articles is a replication study, and the average replication takes over 6 years to appear — meaning most findings in the field have never been independently verified, which matters for anyone trying to apply language learning techniques to their own practice.

What they tested

This is a systematic review of replication studies themselves — not a single experiment. The researchers examined every self-labeled replication study published in second language (L2) research across 26 journals, coding them for 136 different characteristics. They wanted to know:

How many replication studies exist relative to original studies

How long it takes for a replication to appear after the original

Whether replications tend to support or contradict original findings

What factors (like shared authorship or availability of materials) predict whether a replication will confirm the original results

How much change is introduced when researchers attempt a replication

The "intervention" here is the act of replication itself — the researchers were testing the health of the scientific literature, not a specific language learning technique.

Who was studied

The study examined **67 self-labeled replication articles** published across **26 different journals** in second language research. These articles were drawn from the entire published literature, with no date restrictions applied to the initial search (though the replication studies themselves were published across multiple decades). The researchers also examined the **original studies that were being replicated** — meaning the sample includes both the replication articles and their corresponding initial studies.

The population of interest is not human subjects but rather **published research articles** in the field of second language acquisition. However, the original studies being replicated involved a wide range of human participants — typically university-level language learners, though the systematic review does not aggregate participant demographics across studies.

How they measured it

The researchers developed a **136-item coding scheme** to characterize each replication study. Key variables included:

**Replication type** (direct/ exact vs. conceptual/ approximate vs. partial)

**Authorship overlap** (whether any authors of the replication also authored the original)

**Time lag** (years between original and replication publication)

**Citation count** of the original study before replication

**Availability of original materials** (whether the initial study's stimuli, instruments, or protocols were accessible)

**Number and nature of changes** made from the original to the replication

**Outcome** (whether findings supported, partially supported, or contradicted the original)

**Journal impact metrics** and citation rates for replication studies

They also calculated a **replication rate** by estimating the total number of L2 research articles published during the same period and comparing it to the number of replications.

Methodology

### Study Design

This is a **systematic review with meta-synthesis** — meaning the researchers systematically identified, screened, and coded all eligible studies according to pre-registered criteria, then synthesized the results narratively and quantitatively. They did not perform a formal meta-analysis (pooling effect sizes across studies) because the studies were too heterogeneous in design and outcome measures.

### Search and Screening

The researchers searched multiple databases (including ERIC, Linguistics and Language Behavior Abstracts, and Web of Science) using terms related to replication (e.g., "replication," "replicate," "reproduce"). They also hand-searched key journals. Inclusion criteria required that the article explicitly self-identified as a replication study in its title, abstract, or keywords. This yielded 67 articles from an initial pool of thousands.

### Coding and Reliability

Two coders independently coded each study using the 136-item scheme. Inter-rater reliability was assessed and reported as acceptable (Cohen's kappa values ranged from 0.70 to 1.00 across different coding categories). Disagreements were resolved through discussion.

### What This Design Can and Cannot Prove

**What it can prove:**

The **prevalence** of replication studies in L2 research (how many exist relative to original studies)

The **typical characteristics** of those replications (time lag, authorship overlap, types of changes made)

**Correlational patterns** — e.g., whether studies with shared authorship are more likely to support original findings

**What it cannot prove:**

**Causality** — the design cannot tell us *why* replication rates are low, only that they are

**The true replication failure rate** — because the sample only includes *published* replications, it misses unpublished replication attempts (which are likely more negative)

**Generalizability to all L2 research** — the sample is limited to studies that explicitly labeled themselves as replications, which may miss replications that didn't use the term

### Major Methodological Strengths

Pre-registered protocol and coding scheme

Double-coding with reliability checks

Comprehensive search across multiple databases and journals

Open data and materials (available on IRIS repository)

### Major Methodological Weaknesses

**Publication bias** — only published studies were included; failed replications that were never submitted or were rejected are invisible

**Self-labeling bias** — studies that replicated prior work without using the term "replication" were excluded

**No effect size pooling** — the review cannot tell us the average magnitude of replication effects

**Limited to English-language journals** — may miss replication practices in other language research communities

Key findings

### Primary Outcomes

**Replication rate:**

Estimated **1 replication study per 400 articles** published in L2 research

This translates to approximately **0.25% of all published L2 research** being replication studies

**Time lag:**

Mean time between original study and replication: **6.64 years**

Range: from less than 1 year to over 20 years

**Citation threshold:**

Mean number of citations of the original study before a replication appeared: **117 citations**

This suggests that only highly cited studies tend to get replicated

**Authorship overlap:**

**55%** of replication studies shared at least one author with the original study

Studies with authorship overlap were **more likely to support the original findings** (odds ratio not reported, but the correlation was statistically significant)

**Material availability:**

Only **25%** of original studies had their materials publicly available

When materials were available, replications were **more likely to support the original findings**

**Types of replication:**

**Zero direct (exact) replications** were found in the entire sample

All 67 studies were **conceptual or approximate replications** — meaning they introduced changes to the original design

The average number of changes per replication was **not explicitly reported**, but the authors note that changes were "numerous and wide ranging"

**Outcome of replications:**

**~60%** of replications supported the original findings (fully or partially)

**~40%** contradicted or failed to support the original findings

However, because most replications introduced many changes, it is unclear whether failures to replicate reflect true null effects or methodological differences

### Secondary Outcomes

**Citation rates of replication studies:**

Mean annual citation rate for replication studies: **7.3 citations per year**

This is **higher than average citation rates** in linguistics and education (which typically average 1–3 citations per year)

**Journal distribution:**

Replication studies appeared in 26 different journals

No single journal published more than a small fraction of the total

**Geographic and linguistic distribution:**

Most replication studies focused on English as a second language

Limited representation of other target languages

Effect magnitude

The key "effect" here is not a treatment effect but a **prevalence estimate**:

**1 in 400** means that if you read 400 L2 research articles, only 1 will be a replication. To put this in perspective: if you read one article per day, you would encounter a replication study roughly once every 13 months.

**6.64 years** means that findings from a study published today would not typically be replicated until 2029 or later — by which time the original methods, participant populations, and language learning contexts may be outdated.

**117 citations** means that a study needs to be quite influential (cited ~20 times per year) before anyone attempts to replicate it. Less-cited studies — which may still contain important findings — are essentially never replicated.

**55% authorship overlap** means that more than half of all replication attempts involve the same researchers who did the original work. This creates a potential conflict of interest: researchers may be motivated to confirm their own prior findings.

**0% direct replications** means that every single replication in the sample changed something from the original. This makes it impossible to know whether a failed replication is due to the original finding being false, or due to the changes introduced.

Limitations

### What the Authors Acknowledge

The sample is limited to studies that explicitly self-identified as replications, which likely underestimates the true number of replication-like studies

Publication bias means failed replications are underrepresented

The coding scheme could not capture all relevant dimensions of replication quality

The review cannot determine whether low replication rates are due to disincentives, lack of training, or other factors

### What a Critical Reader Would Note

**Sample size and scope:**

67 studies across decades of research is a very small sample — the findings may not be stable

The search was limited to English-language journals, potentially missing replication practices in other language research communities

The review does not distinguish between different subfields of L2 research (e.g., grammar vs. vocabulary vs. pronunciation), which may have different replication cultures

**Definitional issues:**

The requirement that studies self-label as replications is problematic — many researchers may conduct replication-like work without using the term

The distinction between "direct" and "conceptual" replication is itself contested; some argue that exact replication is impossible in human subjects research

**Missing data:**

The review does not report effect sizes for the original vs. replication studies, making it impossible to assess the magnitude of replication failures

No information is provided about statistical power of either original or replication studies

The review does not examine whether replication rates have changed over time (e.g., after the replication crisis in psychology)

**Causal inference:**

The finding that authorship overlap predicts replication success could reflect genuine expertise, but it could also reflect confirmation bias or p-hacking

The finding that material availability predicts replication success could reflect better original methodology, not just easier replication

**Practical relevance:**

The review focuses on academic replication practices, not on which language learning techniques actually work

For someone running a personal experiment, the key takeaway is about the reliability of published research, not about specific interventions

Practical takeaways

For someone running their own n=1 language learning experiment, this paper offers important lessons about how to interpret and apply published research findings:

### What to Test

**Don't just test one technique** — test it multiple times, in slightly different ways, to see if the effect is robust

**Prioritize techniques that have been replicated** — look for findings that have been confirmed by independent researchers (not just the original lab)

**Be skeptical of single studies** — any finding that has only been reported once, especially if it's dramatic or counterintuitive, should be treated as provisional

### Minimum Meaningful Duration

**Run each condition for at least 2–4 weeks** — many L2 studies use short interventions (single sessions), and the effects may not persist

**Repeat the experiment at least once** — the paper shows that even published replications often fail; your own n=1 results are even less reliable if tested only once

**Allow for washout periods** — if you're testing two different techniques, leave at least 1–2 weeks between conditions to avoid carryover effects

### What to Measure

**Primary outcome:** Some objective measure of language learning (e.g., vocabulary recall accuracy, grammar test scores, fluency in timed production)

**Secondary outcomes:** Subjective measures (e.g., confidence, enjoyment, perceived effort) — but treat these as exploratory, not confirmatory

**Process measures:** Time spent, number of repetitions, attention level — these help you understand *why* a technique worked or didn't

**Baseline measures:** Test your ability before starting, so you can measure change

### Key Confounds to Control For

**Practice effects** — if you always test technique A first and technique B second, any improvement could be due to practice, not the technique

**Time of day** — language learning ability varies with circadian rhythms; test at the same time each day

**Sleep quality** — sleep consolidates memory; poor sleep the night after learning can wipe out gains

**Prior exposure** — if you've already encountered some of the vocabulary or grammar, your results will be inflated

**Motivation and mood** — these fluctuate and affect learning; track them daily

**Expectation effects** — if you believe technique A is better, you may unconsciously work harder during that condition

### What a Positive Result Would Look Like

**Consistent improvement** across multiple measures (not just one)

**Replicable across time** — you get similar results when you repeat the experiment

**Larger than measurement noise** — the improvement should be bigger than your typical day-to-day variation (e.g., if your vocabulary recall varies by ±10% on a normal day, you need to see >10% improvement to be confident)

**Dose-response relationship** — more practice leads to more improvement (not just a one-time boost)

### Specific Recommendations Based on This Paper

1. **Treat published findings as hypotheses, not facts.** The paper shows that most L2 findings have never been independently verified. When you read a study claiming that "spaced repetition improves vocabulary retention by 30%," treat that as a starting point for your own testing, not as a guaranteed result.

2. **Look for replication evidence.** Before investing weeks in a technique, search for whether the finding has been replicated. If the only study is from the original lab, be cautious. If multiple independent labs have confirmed it, you can be more confident.

3. **Run your own mini-replications.** Test a technique for 2–4 weeks, then test it again 2–4 weeks later with slight variations (different time of day, different word lists, different spacing intervals). If you get the same result both times, you have your own replication.

4. **Document everything.** The paper shows that material availability predicts replication success. Keep detailed notes on exactly what you did, when, and under what conditions. This will help you (or others) replicate your own findings.

5. **Be skeptical of dramatic results.** The paper found that studies with authorship overlap were more likely to support original findings — suggesting that even well-intentioned researchers may be biased. In your own experiments, be especially skeptical of results that confirm your expectations.

6. **Expect failures.** The paper found that ~40% of published replications fail to confirm original findings. In your own n=1 experiments, expect that some techniques won't work for you, even if they worked in published studies. That's normal — it doesn't mean you're doing it wrong.

7. **Use multiple outcome measures.** The paper's coding scheme included 136 variables — far more than any single study measures. In your own experiments, measure at least 2–3 different outcomes (e.g., recall accuracy, reaction time, and self-rated confidence) to get a fuller picture of whether a technique is working.

8. **Control for the Hawthorne effect.** Just paying attention to your learning can improve it. Run a "no intervention" baseline period (2 weeks of your normal routine) before testing any new technique, so you can separate the effect of the technique from the effect of paying attention.

9. **Consider the time lag.** The paper found an average 6.64-year gap between original and replication studies. If you're using a technique from a recent paper (published in the last 2–3 years), there may not be any replication evidence yet. Be extra cautious with very new findings.

10. **Share your results.** The paper advocates for open materials and data. If you run a well-documented n=1 experiment, consider sharing your protocol and results — even if they're negative. This contributes to the replication culture the paper is calling for.

Read full paper →More Language Learning research