Behaviour change techniques: the development and evaluation of a taxonomic method for reporting and describing behaviour change interventions (a suite of five studies involving consensus methods, randomised controlled trials and analysis of qualitative data)
Read full paper →- Authors
- Susan Michie, Caroline E Wood, Marie Johnston, Charles Abraham, Jill Francis, Wendy Hardeman
- Journal
- Health Technology Assessment
- Year
- 2015
- Citations
- 687
TL;DR
This 3-year project produced a standardised "menu" of 93 named behaviour change techniques (BCTs) — things like goal-setting, self-monitoring, and social support — that anyone designing or analysing a behaviour change intervention can use as a shared vocabulary; knowing which specific techniques an intervention contains is the first step toward figuring out what actually works. ---
What they tested
The project did not test whether any specific behaviour change intervention improves health outcomes. Instead, it asked a different question: **can we build a reliable, shared labelling system for the "active ingredients" inside behaviour change interventions?**
The five studies tested:
Whether a list of 93 distinct BCTs could be agreed upon by international experts (Study 1)
Whether that list could be organised into a useful hierarchical structure (Study 2)
Whether people could be trained to use the taxonomy reliably (Study 3)
Whether trained coders could identify BCTs consistently in published intervention descriptions (Study 4)
Whether having access to the taxonomy helped people write clearer, more replicable intervention descriptions (Studies 5a, 5b, 5c)
Comparators in the training and description studies were untrained vs. trained users, and writers with vs. without access to BCTTv1.
Outcome measures included:
Intercoder reliability (do two coders agree on which BCTs are present?)
Validity (do coders agree with expert consensus?)
Trainee confidence ratings
Quality ratings of written intervention descriptions (clarity, ease of understanding, replicability)
---
Who was studied
A total of approximately 400 participants across five studies, with some overlap between studies:
**Study 1:** 41 people — 19 international behaviour change experts, 16 members of an International Advisory Board (IAB), 5 research team members, 1 lay person
**Study 2:** 36 experts — 18 from Study 1 plus 18 additional experts experienced in designing interventions or conducting systematic reviews
**Study 3:** 161 trainee coders — systematic reviewers, researchers, practitioners, and policy-makers from 12 countries
**Study 4:** 40 trained coders drawn from Study 3
**Studies 5a/5b/5c:** 190 participants — 166 trainee intervention reporters, 12 smoking cessation practitioners with no BCT taxonomy experience, and 12 trained coders from Studies 3 and 4
All participants were professionals or researchers engaged with behaviour change, not members of the general public. The sample skews heavily toward academic and clinical experts.
---
How they measured it
**Intercoder reliability:** Prevalence- and Bias-Adjusted Kappa (PABAK), where PABAK > 0.60 is considered "good" and > 0.70 is considered strong
**Validity:** Agreement between individual coders and a panel of experienced experts (defined as people with ≥15 years of experience coding interventions), also measured by PABAK
**Trainee competence threshold:** PABAK > 0.60 agreement with expert consensus
**Description quality:** Rated by trained coders on clarity, ease of understanding, and ease of replication (specific scales not detailed in the abstract)
**Confidence:** Self-rated confidence before and after training
**Test-retest reliability:** Same coders recoded the same materials 1 month apart
Coding materials in Study 4 were 40 published intervention descriptions sampled from BMC Public Health, Implementation Science, and BMC Health Services Research.
---
Methodology
This was a **suite of five methodological studies** using a mix of designs, not a single intervention trial. It is best understood as a measurement-science and tool-development project:
**Study 1** used a **Delphi procedure** — a structured consensus method where experts iteratively rate and refine items across multiple rounds until agreement is reached. This is appropriate for building consensus but cannot prove the resulting taxonomy is "correct," only that experts agreed on it.
**Study 2** used **open-sort ("bottom-up") and closed-sort ("top-down") tasks** with 36 experts to create a hierarchical grouping of the 93 BCTs. Participants physically grouped BCT cards, and statistical clustering methods were applied.
**Study 3** compared coding performance **before and after training** (within-person pre-post design) across two training formats: 1-day workshops and distance group tutorials. There was no randomisation to training format; participants self-selected or were assigned by availability.
**Study 4** assessed reliability among 40 coders who had completed Study 3 training, coding 40 published intervention descriptions at two time points 1 month apart. This is an observational reliability study, not an RCT.
**Studies 5a, 5b, 5c** used **RCT and within-person designs** to test whether access to BCTTv1 (with or without training) improved the quality of written intervention descriptions. Study 5a and 5b were between-person comparisons; Study 5c was a within-person before/after design.
**What this design can prove:** That the taxonomy can be used reliably by trained experts to label intervention content, and that training improves agreement with expert consensus.
**What this design cannot prove:** That BCTTv1 actually helps identify which BCTs *cause* behaviour change. The taxonomy is a labelling tool, not an effectiveness tool. It also cannot prove the taxonomy is complete, or that the 93 BCTs are the right level of granularity for all purposes.
**Major methodological weaknesses:**
Participants were overwhelmingly professional researchers and practitioners — generalisability to everyday practitioners or the public is untested
The "expert consensus" that defines validity is itself a social construct; the study team served as the gold standard, which is circular
Training format was not randomised in Study 3, making it hard to compare the two methods
Study 5 results were mixed and partially contradictory, limiting conclusions about whether the taxonomy actually improves reporting in practice
---
Key findings
**Study 1 (taxonomy development):**
Produced 93 distinct, non-overlapping BCTs with clear labels and definitions, organised into 16 groupings — this is BCTTv1
Built by iterative Delphi consensus with 41 international experts across multiple rounds
**Study 2 (hierarchical structure):**
Experts naturally grouped the 93 BCTs into an average of 15.1 groupings (SD 6.11, range 5–24) using a bottom-up sort
A top-down theory-driven sort linked 59 of the 93 BCTs reliably to 12 of 14 theoretical domains (47 significant, 12 borderline)
The two methods showed significant but only moderate overlap: chi-squared = 437.80, p < 0.001, but only 6 of 208 possible bottom-up × top-down pairings showed strong similarity
**Study 3 (training evaluation):**
Both 1-day workshops and distance tutorials improved trainee agreement with expert consensus (both p < 0.05)
Both training formats doubled the proportion of trainees achieving competence (PABAK > 0.60 with expert consensus), p < 0.05 for both
46% of workshop trainees and 78% of tutorial trainees reached competence after training
Workshop trainees' confidence improved significantly (p < 0.001)
Neither training format improved intercoder agreement between trainees (workshops p = 0.08; tutorials p = 0.57)
**Study 4 (reliability of trained coders):**
Good intercoder reliability (PABAK > 0.60) was observed for 80 of 93 BCTs
64 of 80 reliably-coded BCTs (80%) achieved PABAK > 0.70
Test-retest reliability within coders over 1 month was good (p < 0.001)
Reliability was *worse* for frequently identified BCTs: 9 of 16 frequently used BCTs failed to reach PABAK > 0.70
Good validity: trained coders agreed with expert consensus on 14 of 15 BCTs identified as present by experts
**Studies 5a/5b/5c (reporting descriptions):**
Study 5a: providing BCTTv1 alone (no training) made no difference to description quality, reliability, or validity (all p > 0.05)
Study 5b: descriptions written by untrained writers *without* BCTTv1