CONSORT 2010 statement: extension to randomised pilot and feasibility trials
Read full paper →- Authors
- Sandra Eldridge, Claire Chan, Michael J. Campbell, Christine Bond, Sally Hopewell, Lehana Thabane, Gillian Lancaster
- Journal
- BMJ
- Year
- 2016
- Citations
- 3,389
TL;DR
This paper provides a standardised checklist for reporting pilot and feasibility trials, which helps researchers (including self-experimenters) distinguish between studies that test whether an intervention *can* work from those that test whether it *does* work—preventing premature conclusions from underpowered data.
What they tested
This is not an experimental study but a **reporting guideline**—a consensus document developed by 29 experts in clinical trials, methodology, and publishing. The "intervention" is a 26-item checklist (the CONSORT extension for pilot/feasibility trials) designed to improve how researchers write up preliminary studies. The "comparator" is the original CONSORT 2010 statement for full randomised trials. The "outcome" is a set of reporting standards that distinguish pilot/feasibility trials from definitive trials, with specific emphasis on:
**Primary objective:** Whether the study is assessing feasibility (e.g., recruitment rates, protocol adherence) rather than treatment efficacy
**Sample size justification:** Why no formal power calculation is needed for pilot trials
**Progression criteria:** Pre-specified rules for deciding whether to proceed to a full trial
**Outcome measures:** Feasibility outcomes (e.g., recruitment, retention, data completeness) versus clinical outcomes
Who was studied
No human participants were studied. The "sample" was:
**29 expert panellists** from the UK, Canada, US, Australia, and Europe
**Specialties included:** clinical trial methodology, biostatistics, medical journal editing (BMJ, Lancet, JAMA), and regulatory science
**Setting:** A two-day consensus meeting in London, UK, plus three rounds of Delphi survey (online questionnaire) involving 100+ additional researchers
**Papers reviewed:** The panel examined 83 published pilot/feasibility trials to identify common reporting deficiencies
How they measured it
No instruments or scales were used on human subjects. The "measurement" was a structured consensus process:
**Delphi survey (3 rounds):** Experts rated proposed checklist items on a 1–9 scale (1 = not important, 9 = essential). Items with median scores ≥7 were retained
**Consensus meeting:** Face-to-face discussion to resolve disagreements and finalise wording
**Pilot testing:** The draft checklist was tested against 83 published pilot/feasibility trials to check clarity and completeness
**Final checklist:** 26 items (plus 6 sub-items), each with a description and example
Methodology
**Study design:** This is a **consensus guideline**—not a randomised trial, cohort study, or meta-analysis. The methodology is a modified Delphi process combined with a face-to-face consensus conference.
**How the consensus process worked:**
1. **Literature review:** The team identified 83 published pilot/feasibility trials and catalogued their reporting quality
2. **Delphi survey (Round 1):** 100+ experts received the draft checklist and rated each item's importance
3. **Delphi survey (Round 2):** Items that didn't reach consensus were re-rated after seeing group scores
4. **Delphi survey (Round 3):** Final refinement of borderline items
5. **Consensus meeting (2 days):** 29 experts met in London to finalise the checklist, resolve disagreements, and write explanatory text
6. **External review:** The final checklist was circulated to journal editors and trial registries for comment
**What this design can and cannot prove:**
**Can prove:** That a group of experts agreed on a set of reporting standards. The Delphi method is well-established for developing consensus guidelines (used for STROBE, PRISMA, and original CONSORT)
**Cannot prove:** That using the checklist improves actual trial quality or patient outcomes. This is a *normative* document—it says what *should* be reported, not what *is* effective
**Cannot prove:** That pilot/feasibility trials are inherently different from definitive trials in any biological or statistical sense. The distinction is methodological, not empirical
**Major methodological weaknesses:**
**Selection bias:** Experts were invited by the steering committee; no random sampling
**No blinding:** Panellists knew each other's identities and affiliations
**No formal testing:** The checklist was tested against 83 papers, but there was no randomised trial comparing reporting quality with vs. without the checklist
**No patient/public involvement:** The panel included no patient representatives or self-experimenters
Key findings
The final checklist contains 26 items organised into 6 sections. Here are the most important items for someone running a self-experiment:
**Title and abstract (Item 1):**
Must identify the study as a "pilot" or "feasibility" trial in the title
Abstract must state the objective is feasibility, not efficacy
**Introduction (Item 2a):**
Must describe the rationale for the pilot trial (e.g., "We don't know if we can recruit 50 people in 3 months")
Must state specific feasibility objectives (e.g., "To estimate recruitment rate, retention rate, and protocol adherence")
**Methods – Sample size (Item 7a):**
**No formal power calculation** is required. Instead, justify sample size based on precision of feasibility estimates (e.g., "A sample of 30 gives a 95% confidence interval of ±18% for a 50% recruitment rate")
**Critical point:** Pilot trials are NOT designed to detect treatment effects. A "significant" p-value in a pilot trial is meaningless
**Methods – Outcomes (Item 6a):**
Primary outcomes must be feasibility measures (e.g., recruitment rate, retention rate, data completeness, protocol adherence)
Clinical outcomes (e.g., blood pressure, mood scores) are secondary or exploratory only
**Results – Participant flow (Item 13a):**
Must report numbers screened, eligible, enrolled, randomised, and analysed
Must report reasons for non-participation and dropout
**Results – Outcomes (Item 15):**
Report feasibility outcomes with precision (e.g., "Recruitment rate: 12 participants per month, 95% CI 8 to 16")
Do NOT report p-values for clinical outcomes
**Discussion – Interpretation (Item 20):**
Must state whether results support proceeding to a full trial
Must discuss implications for trial design (e.g., "We need to extend recruitment by 2 months to achieve target sample size")
**Progression criteria (Item 20, sub-item):**
Pre-specified rules for deciding whether to proceed (e.g., "If recruitment <50% of target, redesign recruitment strategy")
These must be stated in the protocol, not invented post-hoc
Effect magnitude
This is not a study with effect sizes. The "effect" is a change in reporting behaviour. Key numbers:
**83 papers** reviewed to identify common deficiencies
**100+ experts** participated in Delphi rounds
**29 experts** attended the consensus meeting
**26 items** in the final checklist
**6 sub-items** for additional detail
**0 p-values** reported (because the outcome is consensus, not hypothesis testing)
**Translation for self-experimenters:** If you follow this checklist when designing your own pilot experiment, you will:
Avoid the common mistake of claiming an intervention "works" based on 10 data points
Know exactly what sample size you need to estimate feasibility (not efficacy)
Have pre-specified rules for deciding whether to continue, modify, or abandon your experiment
Limitations
**What the authors acknowledge:**
The checklist is based on expert opinion, not empirical evidence
It may need updating as new evidence emerges
It does not cover all types of pilot trials (e.g., those using adaptive designs)
The checklist is for *reporting*, not *conducting*—a well-reported pilot trial can still be poorly designed
**What a critical reader would note:**
**No patient involvement:** The experts were all academics, journal editors, or statisticians. No one who actually runs self-experiments was at the table
**No testing of the checklist's impact:** The authors didn't randomise journals to use vs. not use the checklist and compare reporting quality
**Cultural bias:** All experts were from high-income countries (UK, Canada, US, Australia, Europe). Feasibility issues in low-resource settings may differ
**No consideration of n=1 designs:** The checklist assumes group-level trials. Single-subject experiments (like most self-experiments) have different feasibility criteria
**Publication bias:** The 83 papers reviewed were published—meaning they passed peer review. Unpublished pilot trials (which are more likely to have negative feasibility results) were not examined
Practical takeaways
For someone running their own n=1 experiment:
### What to test
**Pilot phase first:** Before committing to a 90-day experiment, run a 7–14 day pilot to test:
- Can you reliably measure your outcome (e.g., daily mood rating, blood glucose)?
- Can you adhere to the protocol (e.g., take supplement at same time daily)?
- Do you have enough data points to detect a meaningful change?
**Specific intervention:** Choose one variable (e.g., 200mg caffeine at 8am vs. placebo)
**Dose:** Use a dose you can reliably replicate (e.g., one capsule, not "a cup of coffee")
### Minimum meaningful duration
**Pilot phase:** 7–14 days (enough to estimate adherence and measurement reliability)
**Full experiment:** Depends on outcome variability. For mood (high day-to-day variability), you need at least 30–40 data points per condition. For sleep (moderate variability), 14–21 days per condition
**Progression rule:** If you miss >20% of measurements in the pilot, redesign before starting the full experiment
### What to measure (specific metrics)
**Feasibility outcomes (primary):**
- Adherence rate: % of days you took the intervention as scheduled
- Data completeness: % of days you recorded the outcome
- Protocol violations: How many times you deviated from the plan (e.g., took caffeine after 2pm)
**Clinical outcomes (secondary/exploratory):**
- Your primary outcome (e.g., sleep onset latency in minutes)
- Your secondary outcomes (e.g., subjective sleep quality on 1–10 scale)
- Confounders (e.g., stress level, exercise, alcohol)
### Key confounds to control for
**Expectation bias:** You know you're in a pilot. This can inflate adherence and deflate self-reported outcomes. Use a blinded placebo if possible
**Novelty effect:** The first week of any new protocol often shows better adherence. Discard the first 3–5 days of pilot data
**Measurement reactivity:** Tracking a behaviour changes it. If you're measuring sleep, the act of wearing a sleep tracker may improve sleep. Account for this in your interpretation
**Life events:** A single bad day (illness, travel, work stress) can skew pilot data. Pre-specify exclusion criteria (e.g., "If I get sick, restart the pilot after recovery")
### What a positive result would look like
**Feasibility success:**
- Adherence ≥80% (e.g., took supplement on 11 of 14 days)
- Data completeness ≥90% (e.g., recorded sleep on 13 of 14 days)
- Protocol violations ≤2 (e.g., only 2 days where caffeine was taken after 2pm)
**Progression decision:**
- If all three criteria met: Proceed to full experiment
- If adherence <80% but >60%: Modify protocol (e.g., set phone reminders)
- If adherence <60% or data completeness <70%: Abandon this intervention or redesign completely
**What NOT to look for:** Do NOT check whether your sleep improved during the pilot. A 14-day pilot with 7 days of caffeine and 7 days of placebo has zero statistical power to detect a treatment effect. Any apparent improvement is likely noise or placebo
**Example progression rule for a self-experiment:**
> "I will run a 14-day pilot. If I complete ≥12 days of measurements and take the supplement on ≥11 of those days, I will proceed to a 60-day crossover experiment. If I miss >3 days of measurements, I will redesign the measurement protocol (e.g., switch from paper diary to phone app) and restart the pilot."
**Bottom line:** The CONSORT pilot extension teaches self-experimenters that the first experiment you run should answer "Can I do this?" not "Does this work?" Run a pilot first, set explicit progression criteria, and only move to a full experiment if the pilot shows you can reliably measure and adhere. This prevents the common mistake of spending 90 days on an experiment that was doomed from day one by poor measurement or low adherence.