This vignette walks through the four canonical Magenta Book stages for a worked example: a hypothetical GBP 50m skills programme aimed at increasing employment among long-term unemployed claimants. We move from theory of change to evaluation plan to power calculation to confidence rating, all in one R session.
The theory of change links inputs through to long-run impact.
mb_theory_of_change() captures the five canonical Magenta
Book levels plus assumptions and external factors.
toc <- mb_theory_of_change(
inputs = c("GBP 50m grant", "12 FTE programme team",
"Partnership with Jobcentre Plus"),
activities = c("Design training curriculum",
"Deliver workshops in 50 sites",
"Provide ongoing mentoring"),
outputs = c("500 workshops delivered",
"8000 attendees",
"5000 completed mentoring blocks"),
outcomes = c("Improved employability skills",
"Increased job-search confidence",
"Higher application rates"),
impact = "Higher 12-month employment among long-term unemployed",
assumptions = c(
"Workshops cause skills uplift (not just selection of motivated attendees)",
"Skills uplift translates into application behaviour",
"Local labour markets absorb the additional applicants"
),
external_factors = c(
"Macro labour market remains broadly stable",
"No competing employability programme launches in same areas"
),
name = "Skills uplift programme"
)
toc
#>
#> ── Theory of change: Skills uplift programme ───────────────────────────────────
#> Inputs: GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> Activities: Design training curriculum; Deliver workshops in 50 sites; Provide
#> ongoing mentoring
#> Outputs: 500 workshops delivered; 8000 attendees; 5000 completed mentoring
#> blocks
#> Outcomes: Improved employability skills; Increased job-search confidence;
#> Higher application rates
#> Impact: Higher 12-month employment among long-term unemployed
#> Assumptions: Workshops cause skills uplift (not just selection of motivated
#> attendees); Skills uplift translates into application behaviour; Local labour
#> markets absorb the additional applicants
#> External factors: Macro labour market remains broadly stable; No competing
#> employability programme launches in same areas
#> Vintage: magentabook "0.1.0"Pivoting to a logframe with indicators, means of verification, and risks:
mb_logframe(
toc,
indicators = list(
outputs = c("Workshops delivered", "Attendees per workshop"),
outcomes = c("Skills score (post)", "Application count"),
impact = "Employment rate at 12 months"
),
mov = list(
outputs = "Programme delivery log",
outcomes = c("Pre/post survey", "DWP admin data"),
impact = "Linked HMRC PAYE records"
),
risks = list(
outputs = "Attendance below planned levels",
outcomes = "Self-report bias in skills score",
impact = "Macro shock confounds the estimate"
)
)
#>
#> ── Logframe: Skills uplift programme ───────────────────────────────────────────
#> level
#> inputs inputs
#> activities activities
#> outputs outputs
#> outcomes outcomes
#> impact impact
#> description
#> inputs GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> activities Design training curriculum; Deliver workshops in 50 sites; Provide ongoing mentoring
#> outputs 500 workshops delivered; 8000 attendees; 5000 completed mentoring blocks
#> outcomes Improved employability skills; Increased job-search confidence; Higher application rates
#> impact Higher 12-month employment among long-term unemployed
#> indicator
#> inputs <NA>
#> activities <NA>
#> outputs Workshops delivered; Attendees per workshop
#> outcomes Skills score (post); Application count
#> impact Employment rate at 12 months
#> mov risk
#> inputs <NA> <NA>
#> activities <NA> <NA>
#> outputs Programme delivery log Attendance below planned levels
#> outcomes Pre/post survey; DWP admin data Self-report bias in skills score
#> impact Linked HMRC PAYE records Macro shock confounds the estimateThe high-criticality assumptions belong in a separate register:
mb_assumptions(
level = c("activities", "outcomes", "impact"),
description = c(
"Workshops are well-attended",
"Skills uplift translates into job entry",
"Employment rise persists at 12 months"
),
evidence = c(
"Pilot attendance was 80%",
"Indirect: similar programmes show 0.3 SD effect",
"Limited evidence on longer-run persistence"
),
criticality = c("medium", "high", "high")
)
#>
#> ── Assumption register (3 items) ───────────────────────────────────────────────
#> level description
#> 1 activities Workshops are well-attended
#> 2 outcomes Skills uplift translates into job entry
#> 3 impact Employment rise persists at 12 months
#> evidence criticality
#> 1 Pilot attendance was 80% medium
#> 2 Indirect: similar programmes show 0.3 SD effect high
#> 3 Limited evidence on longer-run persistence highTag the evaluation questions by Magenta Book type:
qs <- mb_questions(
text = c(
"Did the programme cause higher 12-month employment",
"How large is the effect, and for whom",
"Was delivery faithful to the design",
"What was the cost per additional job"
),
type = c("impact", "impact", "process", "economic"),
priority = c("primary", "secondary", "secondary", "primary")
)
qs
#>
#> ── Evaluation questions (4 items) ──────────────────────────────────────────────
#> text type priority
#> 1 Did the programme cause higher 12-month employment impact primary
#> 2 How large is the effect, and for whom impact secondary
#> 3 Was delivery faithful to the design process secondary
#> 4 What was the cost per additional job economic primaryPin down the counterfactual:
cf <- mb_counterfactual(
definition = "Eligible non-applicants matched on age, prior unemployment duration, and region",
source = "quasi-experimental",
credibility = "Moderate; selection on observables only, but rich admin covariates available"
)
cf
#>
#> ── Counterfactual ──────────────────────────────────────────────────────────────
#> Definition: Eligible non-applicants matched on age, prior unemployment
#> duration, and region
#> Source: quasi-experimental
#> Credibility: Moderate; selection on observables only, but rich admin covariates
#> availableMap stakeholders for governance:
mb_stakeholders(
name = c("HM Treasury", "DWP", "Local authorities", "What Works Centre"),
role = c("Funder", "Policy lead", "Delivery", "Synthesis"),
raci = c("A", "R", "C", "I"),
interest = c(5, 5, 4, 3),
influence = c(5, 5, 3, 2)
)
#>
#> ── Stakeholders (4 items) ──────────────────────────────────────────────────────
#> name role raci interest influence
#> 1 HM Treasury Funder A 5 5
#> 2 DWP Policy lead R 5 5
#> 3 Local authorities Delivery C 4 3
#> 4 What Works Centre Synthesis I 3 2Bundle into a plan:
plan <- mb_evaluation_plan(
scope = "GBP 50m programme, 50 sites, 2026-2029",
questions = qs,
methods = c(
impact = "Difference-in-differences with matched comparison group",
process = "Mixed-methods implementation review",
economic = "Cost per job, with QALY-adjusted variant"
),
timing = c(baseline = "2026-Q1", midline = "2027-Q4", endline = "2029-Q2"),
governance = "Joint HMT / DWP steering group; peer review by What Works Centre",
budget = 1.5e6
)
plan
#>
#> ── Evaluation plan ─────────────────────────────────────────────────────────────
#> Scope: GBP 50m programme, 50 sites, 2026-2029
#> Questions: 4 (primary: 2)
#> Method (impact): Difference-in-differences with matched comparison group
#> Method (process): Mixed-methods implementation review
#> Method (economic): Cost per job, with QALY-adjusted variant
#> Timing: 2026-Q1; 2027-Q4; 2029-Q2
#> Governance: Joint HMT / DWP steering group; peer review by What Works Centre
#> Budget: "GBP 1.50m"
#> Vintage: magentabook "0.1.0"The Magenta Book stresses that an evaluation is only worth running if it can detect effects of policy-relevant size. We size the study assuming a target detectable effect of 5 percentage points on the employment rate, baseline employment of 30 percent, and 80 percent power.
Naive (individual-level) sample size:
But the programme is delivered in clusters (sites), so we need to inflate by the design effect. Jobcentre-level outcomes have an ICC around 0.04 (per the bundled DWP reference values):
mb_icc_reference("employment")
#> domain outcome unit_of_clustering icc_low icc_central icc_high
#> 8 employment job_entry jobcentre 0.02 0.04 0.08
#> 9 employment wage jobcentre 0.03 0.06 0.10
#> value_source
#> 8 central_estimate
#> 9 central_estimate
#> source
#> 8 DWP impact evaluations (synthesis across multiple programmes)
#> 9 DWP impact evaluations
#> notes
#> 8 Claimant-level outcomes within Jobcentre Plus offices; central value is researcher synthesis
#> 9 Claimant wage outcomes within Jobcentres; central value is researcher synthesis
mb_cluster_design(individuals_per_cluster = 50, icc = 0.04, n_clusters = 25)
#> $deff
#> [1] 2.96
#>
#> $n_total_per_arm
#> [1] 1250
#>
#> $n_effective_per_arm
#> [1] 422.2973The design effect is a meaningful uplift; we would need roughly that multiple of the naive N per arm. Alternatively, a stepped-wedge design could trade a larger total N for staggered rollout that fits programme delivery:
mb_stepped_wedge(
steps = 5, clusters_per_step = 5,
individuals_per_cluster = 50, icc = 0.04
)
#> $deff_cluster
#> [1] 10.96
#>
#> $correction_factor
#> [1] 0.3
#>
#> $deff_sw
#> [1] 3.288
#>
#> $n_total
#> [1] 1250What is the smallest effect we can detect with the planned design?
Once the evaluation has run, score it on the Maryland SMS:
sms <- mb_sms_rate(
level = 4,
study = "Smith et al. (2029) Skills uplift evaluation",
design = "Difference-in-differences with matched comparison",
notes = "Parallel trends supported by 4 pre-period observations; cluster-robust SEs"
)
sms
#>
#> ── Maryland SMS Level 4: Strong ────────────────────────────────────────────────
#> Study: Smith et al. (2029) Skills uplift evaluation
#> Design: Difference-in-differences with matched comparison
#> Notes: Parallel trends supported by 4 pre-period observations; cluster-robust
#> SEs
#> Description: Comparison between treatment and comparison units accounting for
#> unobservable differences
#> Causal inference: Strong if identifying assumptions holdRecord a structured confidence rating:
conf_main <- mb_confidence(
rating = "medium",
question = "Did the programme raise 12-month employment",
evidence_strength = "One Level 4 DiD (n = 12000); supportive Level 3 cohort study",
methodological_quality = "Adequate; parallel trends plausible; some attrition concerns",
generalisability = "Established across 50 sites in two regions",
rationale = "Effect direction consistent across two studies but limited replication outside the programme footprint"
)
conf_main
#>
#> ── Medium confidence ───────────────────────────────────────────────────────────
#> Question: Did the programme raise 12-month employment
#> Evidence strength: One Level 4 DiD (n = 12000); supportive Level 3 cohort study
#> Methodological quality: Adequate; parallel trends plausible; some attrition
#> concerns
#> Generalisability: Established across 50 sites in two regions
#> Rationale: Effect direction consistent across two studies but limited
#> replication outside the programme footprint
#> Decision implication: Indicative evidence; supports continued investment with
#> monitoring
conf_process <- mb_confidence(
rating = "high",
question = "Was the programme implemented faithfully",
evidence_strength = "Mixed-methods process evaluation; 50-site fidelity audit",
methodological_quality = "Strong; documented fidelity protocol with inter-rater reliability",
generalisability = "All sites covered",
rationale = "Comprehensive coverage; consistent fidelity scores"
)
mb_confidence_summary(conf_main, conf_process)
#>
#> ── Confidence summary (2 ratings) ──────────────────────────────────────────────
#> high: 1
#> medium: 1
#> low: 0
#>
#> ── Ratings ──
#>
#> question rating
#> 1 Did the programme raise 12-month employment medium
#> 2 Was the programme implemented faithfully high
#> rationale
#> 1 Effect direction consistent across two studies but limited replication outside the programme footprint
#> 2 Comprehensive coverage; consistent fidelity scoresA single mb_report object aggregates everything:
report <- mb_evaluation_report(
plan = plan,
toc = toc,
sms = sms,
confidence = list(conf_main, conf_process),
name = "Skills uplift evaluation"
)
report
#> ── Magenta Book evaluation report: Skills uplift evaluation ────────────────────
#> Theory of change: present
#> Plan: present
#> SMS ratings: 1
#> Confidence ratings: 2
#> Cost-effectiveness items: 0
#> Vintage: magentabook "0.1.0"Export to LaTeX for a one-pager:
cat(mb_to_latex(report, caption = "Skills uplift evaluation summary"))
#> \begin{table}[h]
#> \centering
#> \begin{tabular}{ll}
#> \hline
#> Component & Value \\
#> \hline
#> Name & Skills uplift evaluation \\
#> Vintage & magentabook 0.1.0 \\
#> Has theory of change & yes \\
#> Has plan & yes \\
#> SMS ratings & 1 \\
#> Confidence ratings & 2 \\
#> Cost-effectiveness items & 0 \\
#> \hline
#> \end{tabular}\caption{Skills uplift evaluation summary}
#>
#> \end{table}Word and Excel exports are available via mb_to_word()
and mb_to_excel() (both require optional packages:
officer + flextable, and openxlsx
respectively).
Every result object stamps the package vintage. Bundled rubric and
reference tables expose their source via
mb_data_versions():
mb_data_versions()
#> dataset
#> 1 sms_rubric
#> 2 confidence_rubric
#> 3 icc_reference
#> 4 question_taxonomy
#> source
#> 1 Sherman, Gottfredson, MacKenzie, Eck, Reuter & Bushway (1997). Preventing Crime: What Works, What Doesn't, What's Promising. Numeric levels 1-5 are the original Maryland Scientific Methods Scale.
#> 2 Synthesised from What Works Centre confidence-rating traditions: Education Endowment Foundation (5 padlocks), Early Intervention Foundation (Foundation Standards), College of Policing (1-5 scale), and the Justice Data Lab (red / amber / green). Three-level high / medium / low structure adopted to align with HM Treasury Magenta Book (2020) supplementary value-for-money guidance.
#> 3 Hedges & Hedberg (2007); Adams, Gulliford, Ukoumunne, Eldridge, Chinn & Campbell (2004); Campbell, Mollison & Grimshaw (2000); EEF / DfE / DWP / MHCLG / MoJ impact-evaluation reports.
#> 4 HM Treasury Magenta Book (2020) chapters on process, impact, and economic evaluation; supplementary Magenta Book guides on value for money and theory-based evaluation.
#> last_updated
#> 1 2026-04-27
#> 2 2026-04-27
#> 3 2026-04-27
#> 4 2026-04-27
#> notes
#> 1 Numeric levels 1-5 are direct from Sherman et al. (1997). Word labels (Weakest / Weak / Moderate / Strong / Strongest) follow What Works UK / Education Endowment Foundation convention. Design examples and typical-use columns are magentabook synthesis.
#> 2 Not a direct quotation from the Magenta Book. magentabook synthesis of cross-What-Works-Centre confidence-rating traditions. Three-level structure designed for Treasury / consultancy decision-grade reporting.
#> 3 Reference intra-class correlation coefficients across UK policy domains. Each row is tagged in the bundled CSV with value_source = 'table_quote' (direct extraction with table number) or 'central_estimate' (researcher synthesis within published range). Practitioners should compute domain-specific ICCs from baseline data wherever feasible.
#> 4 Magenta Book canonical evaluation question taxonomy with methods and chapter references. Sub-types (e.g. 'attribution', 'fidelity') are conventional categories used across HMG evaluation practice.