Systematic ReviewWikiCyclingHigh evidence score

A refined taxonomy of behaviour change techniques to help people change their physical activity and healthy eating behaviours: The CALO-RE taxonomy

Authors: Susan Michie, Stefanie Ashford, Falko F. Sniehotta, Stephan U Dombrowski, Alex Bishop, David French
Journal: Psychology and Health
Year: 2011
DOI: 10.1080/08870446.2010.540664
Citations: 1,825

TL;DR

This paper created a standardised, reliable list of 40 distinct behaviour change techniques (BCTs) for physical activity and healthy eating interventions, giving researchers and practitioners a common language to describe exactly what they did — so you can figure out which specific techniques actually work, rather than relying on vague labels like "counselling" or "motivational support."

What they tested

The researchers did not test an intervention on participants. Instead, they tested a **classification system** — a taxonomy — for describing the active ingredients of behaviour change programmes. They took an existing 26-item taxonomy (Abraham & Michie, 2008) and refined it by:

Applying it to real intervention descriptions from two large systematic reviews

Identifying where the taxonomy was unclear, missing techniques, or had overlapping categories

Iteratively revising labels, definitions, and adding new techniques

Testing whether independent coders could reliably agree on which techniques were present in published intervention descriptions

The outcome was a 40-item taxonomy (the CALO-RE taxonomy) with demonstrated inter-rater reliability (kappa = 0.79, which is "substantial" agreement).

Who was studied

No human participants were studied. The "subjects" were **published intervention descriptions** drawn from two systematic reviews:

**Review 1:** Interventions targeting physical activity and healthy eating in obese adults with additional risk factors for morbidity (e.g., type 2 diabetes, cardiovascular disease)

**Review 2:** Interventions targeting self-efficacy to promote lifestyle and recreational physical activity

The final reliability test used **50 published intervention descriptions** randomly selected from these reviews. The coders were researchers at three UK universities (University College London, Coventry University, University of Aberdeen).

How they measured it

The primary measurement was **inter-rater reliability** — the degree to which two independent coders agreed on whether a given BCT was present or absent in an intervention description. This was quantified using:

**Cohen's kappa (κ):** A statistical measure that corrects for chance agreement. Values range from -1 (worse than chance) to +1 (perfect agreement). A kappa of 0.79 is considered "substantial" agreement (Landis & Koch, 1977).

**Percentage agreement:** The raw proportion of times coders agreed, also reported.

Each coder read the published description of an intervention and marked which of the 40 BCTs were present. They did not interact with participants or collect any behavioural data.

Methodology

**Study design:** Systematic review and taxonomy development study. This is not an experiment testing an intervention — it is a methodological paper that creates and validates a measurement tool.

**Process:** The three research centres worked independently on their own systematic reviews, both using the original 26-item Abraham & Michie (2008) taxonomy. They then collaborated in an iterative refinement cycle:

1. **Identify problems:** Each centre coded intervention descriptions and noted where the taxonomy was unclear, where techniques overlapped, or where techniques were missing.

2. **Revise taxonomy:** The three teams met (by consensus discussion) to refine labels, definitions, and add new techniques.

3. **Test reliability:** A new set of papers was coded by multiple raters, and kappa was calculated.

4. **Repeat:** This cycle was completed **four times**, with each iteration coding 1–2 papers, calculating kappas, and revising the taxonomy.

**Final reliability test:** 50 published intervention descriptions were coded independently by two raters. The final kappa was 0.79.

**What this design can prove:**

That the taxonomy can be applied reliably by trained coders to published intervention descriptions

That the taxonomy covers the range of techniques used in physical activity and healthy eating interventions (at least those published up to ~2010)

That the taxonomy improves upon the original 26-item version in terms of clarity, comprehensiveness, and reliability

**What this design cannot prove:**

Which BCTs are effective at changing behaviour (that requires experimental studies or meta-analyses)

Whether the taxonomy captures all possible BCTs (new techniques may exist or emerge)

Whether the taxonomy is useful for intervention designers or practitioners in the field (only tested with academic coders)

Whether the taxonomy applies to behaviours other than physical activity and healthy eating

**Major methodological weaknesses:**

The taxonomy was developed and tested by the same researchers who created it — there is a risk of confirmation bias

The reliability test used only two coders; more coders would give a more robust estimate

The 50 papers used for the final reliability test came from the same two reviews used to develop the taxonomy, so the taxonomy may be overfitted to those specific interventions

No test of whether the taxonomy improves actual intervention reporting or replication in practice

The consensus process for adding/refining techniques was expert judgement, not data-driven (e.g., no formal Delphi process or empirical testing of which distinctions matter for outcomes)

Key findings

**Primary outcome: A refined 40-item taxonomy**

The original 26-item taxonomy was expanded to 40 items. Changes included:

**14 new techniques added**, including:

- Prompting generalisation of a target behaviour

- Prompting identification as a role model

- Prompting anticipated regret

- Fear arousal

- Prompting self-talk

- Use of follow-up prompts

- Facilitating social comparison

- Time management

- Stress management

- Emotional control training

- Motivational interviewing

- Communication skills training

- Stimulus control (environmental restructuring)

- Use of imagery

**Several techniques were split** to reduce overlap. For example:

- "Goal setting" was split into "Goal setting (behaviour)" and "Goal setting (outcome)"

- "Self-monitoring" was split into "Self-monitoring of behaviour" and "Self-monitoring of outcome(s) of behaviour"

- "Feedback" was split into "Feedback on performance of behaviour" and "Feedback on outcome(s) of behaviour"

**Labels and definitions were refined** for all 26 original techniques to improve clarity and reduce ambiguity

**Reliability:**

Final inter-rater reliability: **kappa = 0.79** (substantial agreement)

This is an improvement over the original taxonomy, which reported kappa values ranging from 0.60 to 0.79 across different techniques

**Secondary findings (from the two systematic reviews that motivated the taxonomy):**

Interventions that included **self-monitoring** combined with at least one other technique from Control Theory (goal setting, feedback, review of goals) were more effective than those that did not

Effective interventions tended to use **fewer techniques** (not more) — suggesting that parsimony may be beneficial

Interventions using more techniques associated with **Control Theory** (Carver & Scheier, 1998) achieved larger effect sizes

Effect magnitude

This is not applicable in the usual sense because no intervention effect was measured. However, the reliability improvement can be quantified:

The original 26-item taxonomy had variable reliability across techniques, with some techniques showing only "fair" to "moderate" agreement (kappa 0.40–0.60)

The refined 40-item taxonomy achieved kappa = 0.79 overall, which is "substantial" — meaning that two independent coders agreed on the presence/absence of techniques about 79% of the time after correcting for chance

In practical terms: if you and a friend both read the same intervention description and used the CALO-RE taxonomy, you would agree on which techniques were present roughly 4 out of 5 times. That is good enough for research purposes, but not perfect — about 1 in 5 techniques would be coded differently.

Limitations

**Acknowledged by authors:**

The taxonomy was developed using only physical activity and healthy eating interventions — it may not generalise to other behaviours (e.g., smoking cessation, medication adherence, alcohol reduction)

The taxonomy is based on published descriptions, which may not accurately reflect what was actually delivered in practice

The reliability test used only two coders; more coders would strengthen the estimate

The taxonomy does not specify how techniques should be combined or delivered (e.g., dose, frequency, mode)

**Critical reader observations:**

**Circularity:** The same research teams who developed the taxonomy also tested its reliability on the same types of interventions used to develop it. An independent replication is needed.

**No prospective validation:** The taxonomy has not been tested by having new intervention developers use it to write protocols, then checking whether those protocols are clearer or more replicable.

**Expert consensus bias:** The decisions about which techniques to add, split, or refine were made by expert discussion, not by empirical testing of whether these distinctions matter for outcomes.

**Publication date:** The taxonomy was published in 2011. Since then, behaviour change science has advanced considerably. A more comprehensive taxonomy (the BCT Taxonomy v1, with 93 techniques) was published by Michie et al. in 2013. The CALO-RE taxonomy is now superseded for most purposes.

**No user testing:** The taxonomy was designed for researchers coding published papers, not for practitioners or individuals designing their own behaviour change programmes. Its usability for non-experts is unknown.

**Cultural and contextual limits:** The interventions reviewed were predominantly from Western, high-income countries. Techniques that work in these contexts may not translate.

Practical takeaways

For someone running their own n=1 experiment, the CALO-RE taxonomy is useful as a **checklist** — it helps you be precise about what you are actually doing, rather than saying "I'll try to be more motivated." Here is how to use it:

### What to test

Pick **one or two specific BCTs** from the taxonomy to test at a time. Do not try everything at once. Good candidates for self-experimentation include:

**Self-monitoring of behaviour** (e.g., tracking every instance of exercise or every meal)

**Goal setting (behaviour)** (e.g., "I will walk for 20 minutes daily" rather than "I will get fitter")

**Feedback on performance** (e.g., reviewing your step count each week)

**Prompting review of behavioural goals** (e.g., weekly check-in on whether you met your goal and adjusting if needed)

**Action planning** (e.g., specifying exactly when, where, and how you will exercise)

**Barrier identification/problem solving** (e.g., identifying what stops you and making a plan for each barrier)

Avoid vague interventions like "I will try harder" or "I will be more motivated." The taxonomy forces you to be specific.

### Minimum meaningful duration

For testing a single BCT: **at least 3–4 weeks** to establish a habit and see if the technique has an effect beyond novelty

For comparing two BCTs (e.g., self-monitoring vs. goal setting alone): **at least 6–8 weeks** with a crossover design (2–3 weeks per condition, plus a washout period)

For testing a combination of techniques: **at least 8–12 weeks**

### What to measure (specific metrics)

**Behavioural frequency:** Count of exercise sessions per week, number of servings of vegetables per day, etc.

**Behavioural duration:** Minutes of exercise per session, total weekly active minutes

**Adherence to the technique itself:** Did you actually self-monitor every day? Did you review your goals weekly? (This is often overlooked — you need to check that you did the intervention)

**Confidence/self-efficacy:** Rate on a 1–10 scale how confident you are that you can maintain the behaviour

**Automaticity:** Use the Self-Report Habit Index (SRHI) or the shorter Self-Report Behavioural Automaticity Index (SRBAI) to measure how habitual the behaviour has become

### Key confounds to control for

**Life events:** Stress, travel, illness, holidays — these will affect your behaviour regardless of the technique. Log them.

**Seasonal effects:** Physical activity naturally varies with weather and daylight. If testing in winter vs. summer, note this.

**Other interventions:** Are you also reading a self-help book, seeing a coach, or using an app? These are additional BCTs.

**Measurement reactivity:** Simply tracking a behaviour can change it (the Hawthorne effect). If you are testing self-monitoring, the act of measuring is part of the intervention.

**Novelty effect:** Any new technique may work initially just because it is new. Run the experiment long enough for novelty to wear off.

**Expectation bias:** You may try harder because you know you are testing something. Consider blinding yourself to the hypothesis if possible (e.g., test two techniques without knowing which one you think will work better).

### What a positive result would look like

**Behaviour change:** A clear increase in your target behaviour (e.g., from 2 to 4 exercise sessions per week) that is sustained for at least 2 weeks after the initial novelty period

**Consistency:** The behaviour becomes more regular (less day-to-day or week-to-week variation)

**Automaticity:** Your SRBAI score increases (e.g., from 2/7 to 5/7), meaning the behaviour feels more automatic and less effortful

**Replicability:** You can repeat the same technique in a different context (e.g., different time of year, different behaviour) and see a similar effect

**Example of a well-specified self-experiment using the CALO-RE taxonomy:**

*"I will test the technique 'Self-monitoring of behaviour' for increasing my daily step count. For 4 weeks, I will wear a pedometer and log my daily steps in a notebook each evening before bed. I will not change anything else about my routine. My outcome measures are: (1) average daily steps per week, (2) day-to-day variability in steps, and (3) adherence to logging (did I log every day?). A positive result would be an increase of at least 2,000 steps/day from my baseline (measured for 1 week before starting) and logging adherence of at least 90% of days."*

**Important caveat:** The CALO-RE taxonomy tells you *what* techniques exist, but it does not tell you *how* to deliver them effectively, *how much* to do them, or *which combinations* work best. For that, you need to consult experimental studies or meta-analyses that have tested specific BCTs against each other. The taxonomy is a vocabulary, not a recipe book.

Read full paper →More Cycling research