Designing a Magenta Book evaluation

This vignette walks through the four canonical Magenta Book stages for a worked example: a hypothetical GBP 50m skills programme aimed at increasing employment among long-term unemployed claimants. We move from theory of change to evaluation plan to power calculation to confidence rating, all in one R session.

Stage 1: theory of change

The theory of change links inputs through to long-run impact. mb_theory_of_change() captures the five canonical Magenta Book levels plus assumptions and external factors.

toc <- mb_theory_of_change(
  inputs     = c("GBP 50m grant", "12 FTE programme team",
                 "Partnership with Jobcentre Plus"),
  activities = c("Design training curriculum",
                 "Deliver workshops in 50 sites",
                 "Provide ongoing mentoring"),
  outputs    = c("500 workshops delivered",
                 "8000 attendees",
                 "5000 completed mentoring blocks"),
  outcomes   = c("Improved employability skills",
                 "Increased job-search confidence",
                 "Higher application rates"),
  impact     = "Higher 12-month employment among long-term unemployed",
  assumptions = c(
    "Workshops cause skills uplift (not just selection of motivated attendees)",
    "Skills uplift translates into application behaviour",
    "Local labour markets absorb the additional applicants"
  ),
  external_factors = c(
    "Macro labour market remains broadly stable",
    "No competing employability programme launches in same areas"
  ),
  name = "Skills uplift programme"
)
toc
#> 
#> ── Theory of change: Skills uplift programme ───────────────────────────────────
#> Inputs: GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> Activities: Design training curriculum; Deliver workshops in 50 sites; Provide
#> ongoing mentoring
#> Outputs: 500 workshops delivered; 8000 attendees; 5000 completed mentoring
#> blocks
#> Outcomes: Improved employability skills; Increased job-search confidence;
#> Higher application rates
#> Impact: Higher 12-month employment among long-term unemployed
#> Assumptions: Workshops cause skills uplift (not just selection of motivated
#> attendees); Skills uplift translates into application behaviour; Local labour
#> markets absorb the additional applicants
#> External factors: Macro labour market remains broadly stable; No competing
#> employability programme launches in same areas
#> Vintage: magentabook "0.1.0"

Pivoting to a logframe with indicators, means of verification, and risks:

mb_logframe(
  toc,
  indicators = list(
    outputs  = c("Workshops delivered", "Attendees per workshop"),
    outcomes = c("Skills score (post)", "Application count"),
    impact   = "Employment rate at 12 months"
  ),
  mov = list(
    outputs  = "Programme delivery log",
    outcomes = c("Pre/post survey", "DWP admin data"),
    impact   = "Linked HMRC PAYE records"
  ),
  risks = list(
    outputs  = "Attendance below planned levels",
    outcomes = "Self-report bias in skills score",
    impact   = "Macro shock confounds the estimate"
  )
)
#> 
#> ── Logframe: Skills uplift programme ───────────────────────────────────────────
#>                 level
#> inputs         inputs
#> activities activities
#> outputs       outputs
#> outcomes     outcomes
#> impact         impact
#>                                                                                         description
#> inputs                        GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> activities     Design training curriculum; Deliver workshops in 50 sites; Provide ongoing mentoring
#> outputs                    500 workshops delivered; 8000 attendees; 5000 completed mentoring blocks
#> outcomes   Improved employability skills; Increased job-search confidence; Higher application rates
#> impact                                        Higher 12-month employment among long-term unemployed
#>                                              indicator
#> inputs                                            <NA>
#> activities                                        <NA>
#> outputs    Workshops delivered; Attendees per workshop
#> outcomes        Skills score (post); Application count
#> impact                    Employment rate at 12 months
#>                                        mov                               risk
#> inputs                                <NA>                               <NA>
#> activities                            <NA>                               <NA>
#> outputs             Programme delivery log    Attendance below planned levels
#> outcomes   Pre/post survey; DWP admin data   Self-report bias in skills score
#> impact            Linked HMRC PAYE records Macro shock confounds the estimate

The high-criticality assumptions belong in a separate register:

mb_assumptions(
  level = c("activities", "outcomes", "impact"),
  description = c(
    "Workshops are well-attended",
    "Skills uplift translates into job entry",
    "Employment rise persists at 12 months"
  ),
  evidence = c(
    "Pilot attendance was 80%",
    "Indirect: similar programmes show 0.3 SD effect",
    "Limited evidence on longer-run persistence"
  ),
  criticality = c("medium", "high", "high")
)
#> 
#> ── Assumption register (3 items) ───────────────────────────────────────────────
#>        level                             description
#> 1 activities             Workshops are well-attended
#> 2   outcomes Skills uplift translates into job entry
#> 3     impact   Employment rise persists at 12 months
#>                                          evidence criticality
#> 1                        Pilot attendance was 80%      medium
#> 2 Indirect: similar programmes show 0.3 SD effect        high
#> 3      Limited evidence on longer-run persistence        high

Stage 2: evaluation plan

Tag the evaluation questions by Magenta Book type:

qs <- mb_questions(
  text = c(
    "Did the programme cause higher 12-month employment",
    "How large is the effect, and for whom",
    "Was delivery faithful to the design",
    "What was the cost per additional job"
  ),
  type     = c("impact", "impact", "process", "economic"),
  priority = c("primary", "secondary", "secondary", "primary")
)
qs
#> 
#> ── Evaluation questions (4 items) ──────────────────────────────────────────────
#>                                                 text     type  priority
#> 1 Did the programme cause higher 12-month employment   impact   primary
#> 2              How large is the effect, and for whom   impact secondary
#> 3                Was delivery faithful to the design  process secondary
#> 4               What was the cost per additional job economic   primary

Pin down the counterfactual:

cf <- mb_counterfactual(
  definition  = "Eligible non-applicants matched on age, prior unemployment duration, and region",
  source      = "quasi-experimental",
  credibility = "Moderate; selection on observables only, but rich admin covariates available"
)
cf
#> 
#> ── Counterfactual ──────────────────────────────────────────────────────────────
#> Definition: Eligible non-applicants matched on age, prior unemployment
#> duration, and region
#> Source: quasi-experimental
#> Credibility: Moderate; selection on observables only, but rich admin covariates
#> available

Map stakeholders for governance:

mb_stakeholders(
  name = c("HM Treasury", "DWP", "Local authorities", "What Works Centre"),
  role = c("Funder", "Policy lead", "Delivery", "Synthesis"),
  raci = c("A", "R", "C", "I"),
  interest  = c(5, 5, 4, 3),
  influence = c(5, 5, 3, 2)
)
#> 
#> ── Stakeholders (4 items) ──────────────────────────────────────────────────────
#>                name        role raci interest influence
#> 1       HM Treasury      Funder    A        5         5
#> 2               DWP Policy lead    R        5         5
#> 3 Local authorities    Delivery    C        4         3
#> 4 What Works Centre   Synthesis    I        3         2

Bundle into a plan:

plan <- mb_evaluation_plan(
  scope = "GBP 50m programme, 50 sites, 2026-2029",
  questions = qs,
  methods = c(
    impact   = "Difference-in-differences with matched comparison group",
    process  = "Mixed-methods implementation review",
    economic = "Cost per job, with QALY-adjusted variant"
  ),
  timing = c(baseline = "2026-Q1", midline = "2027-Q4", endline = "2029-Q2"),
  governance = "Joint HMT / DWP steering group; peer review by What Works Centre",
  budget = 1.5e6
)
plan
#> 
#> ── Evaluation plan ─────────────────────────────────────────────────────────────
#> Scope: GBP 50m programme, 50 sites, 2026-2029
#> Questions: 4 (primary: 2)
#> Method (impact): Difference-in-differences with matched comparison group
#> Method (process): Mixed-methods implementation review
#> Method (economic): Cost per job, with QALY-adjusted variant
#> Timing: 2026-Q1; 2027-Q4; 2029-Q2
#> Governance: Joint HMT / DWP steering group; peer review by What Works Centre
#> Budget: "GBP 1.50m"
#> Vintage: magentabook "0.1.0"

Stage 3: power and sample size

The Magenta Book stresses that an evaluation is only worth running if it can detect effects of policy-relevant size. We size the study assuming a target detectable effect of 5 percentage points on the employment rate, baseline employment of 30 percent, and 80 percent power.

Naive (individual-level) sample size:

mb_sample_size(
  type = "proportion", p1 = 0.30, p2 = 0.35,
  power = 0.8, alpha = 0.05
)
#> [1] 1376

But the programme is delivered in clusters (sites), so we need to inflate by the design effect. Jobcentre-level outcomes have an ICC around 0.04 (per the bundled DWP reference values):

mb_icc_reference("employment")
#>       domain   outcome unit_of_clustering icc_low icc_central icc_high
#> 8 employment job_entry          jobcentre    0.02        0.04     0.08
#> 9 employment      wage          jobcentre    0.03        0.06     0.10
#>       value_source
#> 8 central_estimate
#> 9 central_estimate
#>                                                          source
#> 8 DWP impact evaluations (synthesis across multiple programmes)
#> 9                                        DWP impact evaluations
#>                                                                                          notes
#> 8 Claimant-level outcomes within Jobcentre Plus offices; central value is researcher synthesis
#> 9              Claimant wage outcomes within Jobcentres; central value is researcher synthesis
mb_cluster_design(individuals_per_cluster = 50, icc = 0.04, n_clusters = 25)
#> $deff
#> [1] 2.96
#> 
#> $n_total_per_arm
#> [1] 1250
#> 
#> $n_effective_per_arm
#> [1] 422.2973

The design effect is a meaningful uplift; we would need roughly that multiple of the naive N per arm. Alternatively, a stepped-wedge design could trade a larger total N for staggered rollout that fits programme delivery:

mb_stepped_wedge(
  steps = 5, clusters_per_step = 5,
  individuals_per_cluster = 50, icc = 0.04
)
#> $deff_cluster
#> [1] 10.96
#> 
#> $correction_factor
#> [1] 0.3
#> 
#> $deff_sw
#> [1] 3.288
#> 
#> $n_total
#> [1] 1250

What is the smallest effect we can detect with the planned design?

mb_mde(
  n_per_group = 600, type = "proportion",
  baseline = 0.30, power = 0.8
)
#> [1] 0.07641078

Stage 4: rate the evidence

Once the evaluation has run, score it on the Maryland SMS:

sms <- mb_sms_rate(
  level  = 4,
  study  = "Smith et al. (2029) Skills uplift evaluation",
  design = "Difference-in-differences with matched comparison",
  notes  = "Parallel trends supported by 4 pre-period observations; cluster-robust SEs"
)
sms
#> 
#> ── Maryland SMS Level 4: Strong ────────────────────────────────────────────────
#> Study: Smith et al. (2029) Skills uplift evaluation
#> Design: Difference-in-differences with matched comparison
#> Notes: Parallel trends supported by 4 pre-period observations; cluster-robust
#> SEs
#> Description: Comparison between treatment and comparison units accounting for
#> unobservable differences
#> Causal inference: Strong if identifying assumptions hold

Record a structured confidence rating:

conf_main <- mb_confidence(
  rating                 = "medium",
  question               = "Did the programme raise 12-month employment",
  evidence_strength      = "One Level 4 DiD (n = 12000); supportive Level 3 cohort study",
  methodological_quality = "Adequate; parallel trends plausible; some attrition concerns",
  generalisability       = "Established across 50 sites in two regions",
  rationale              = "Effect direction consistent across two studies but limited replication outside the programme footprint"
)
conf_main
#> 
#> ── Medium confidence ───────────────────────────────────────────────────────────
#> Question: Did the programme raise 12-month employment
#> Evidence strength: One Level 4 DiD (n = 12000); supportive Level 3 cohort study
#> Methodological quality: Adequate; parallel trends plausible; some attrition
#> concerns
#> Generalisability: Established across 50 sites in two regions
#> Rationale: Effect direction consistent across two studies but limited
#> replication outside the programme footprint
#> Decision implication: Indicative evidence; supports continued investment with
#> monitoring

conf_process <- mb_confidence(
  rating                 = "high",
  question               = "Was the programme implemented faithfully",
  evidence_strength      = "Mixed-methods process evaluation; 50-site fidelity audit",
  methodological_quality = "Strong; documented fidelity protocol with inter-rater reliability",
  generalisability       = "All sites covered",
  rationale              = "Comprehensive coverage; consistent fidelity scores"
)

mb_confidence_summary(conf_main, conf_process)
#> 
#> ── Confidence summary (2 ratings) ──────────────────────────────────────────────
#> high: 1
#> medium: 1
#> low: 0
#> 
#> ── Ratings ──
#> 
#>                                      question rating
#> 1 Did the programme raise 12-month employment medium
#> 2    Was the programme implemented faithfully   high
#>                                                                                                rationale
#> 1 Effect direction consistent across two studies but limited replication outside the programme footprint
#> 2                                                     Comprehensive coverage; consistent fidelity scores

Bringing it together

A single mb_report object aggregates everything:

report <- mb_evaluation_report(
  plan       = plan,
  toc        = toc,
  sms        = sms,
  confidence = list(conf_main, conf_process),
  name       = "Skills uplift evaluation"
)
report
#> ── Magenta Book evaluation report: Skills uplift evaluation ────────────────────
#> Theory of change: present
#> Plan: present
#> SMS ratings: 1
#> Confidence ratings: 2
#> Cost-effectiveness items: 0
#> Vintage: magentabook "0.1.0"

Export to LaTeX for a one-pager:

cat(mb_to_latex(report, caption = "Skills uplift evaluation summary"))
#> \begin{table}[h]
#> \centering
#> \begin{tabular}{ll}
#> \hline
#> Component & Value \\
#> \hline
#> Name & Skills uplift evaluation \\
#> Vintage & magentabook 0.1.0 \\
#> Has theory of change & yes \\
#> Has plan & yes \\
#> SMS ratings & 1 \\
#> Confidence ratings & 2 \\
#> Cost-effectiveness items & 0 \\
#> \hline
#> \end{tabular}\caption{Skills uplift evaluation summary}
#> 
#> \end{table}

Word and Excel exports are available via mb_to_word() and mb_to_excel() (both require optional packages: officer + flextable, and openxlsx respectively).

Reproducibility

Every result object stamps the package vintage. Bundled rubric and reference tables expose their source via mb_data_versions():

mb_data_versions()
#>             dataset
#> 1        sms_rubric
#> 2 confidence_rubric
#> 3     icc_reference
#> 4 question_taxonomy
#>                                                                                                                                                                                                                                                                                                                                                                                         source
#> 1                                                                                                                                                                                          Sherman, Gottfredson, MacKenzie, Eck, Reuter & Bushway (1997). Preventing Crime: What Works, What Doesn't, What's Promising. Numeric levels 1-5 are the original Maryland Scientific Methods Scale.
#> 2 Synthesised from What Works Centre confidence-rating traditions: Education Endowment Foundation (5 padlocks), Early Intervention Foundation (Foundation Standards), College of Policing (1-5 scale), and the Justice Data Lab (red / amber / green). Three-level high / medium / low structure adopted to align with HM Treasury Magenta Book (2020) supplementary value-for-money guidance.
#> 3                                                                                                                                                                                                      Hedges & Hedberg (2007); Adams, Gulliford, Ukoumunne, Eldridge, Chinn & Campbell (2004); Campbell, Mollison & Grimshaw (2000); EEF / DfE / DWP / MHCLG / MoJ impact-evaluation reports.
#> 4                                                                                                                                                                                                                      HM Treasury Magenta Book (2020) chapters on process, impact, and economic evaluation; supplementary Magenta Book guides on value for money and theory-based evaluation.
#>   last_updated
#> 1   2026-04-27
#> 2   2026-04-27
#> 3   2026-04-27
#> 4   2026-04-27
#>                                                                                                                                                                                                                                                                                                                                                 notes
#> 1                                                                                         Numeric levels 1-5 are direct from Sherman et al. (1997). Word labels (Weakest / Weak / Moderate / Strong / Strongest) follow What Works UK / Education Endowment Foundation convention. Design examples and typical-use columns are magentabook synthesis.
#> 2                                                                                                                                    Not a direct quotation from the Magenta Book. magentabook synthesis of cross-What-Works-Centre confidence-rating traditions. Three-level structure designed for Treasury / consultancy decision-grade reporting.
#> 3 Reference intra-class correlation coefficients across UK policy domains. Each row is tagged in the bundled CSV with value_source = 'table_quote' (direct extraction with table number) or 'central_estimate' (researcher synthesis within published range). Practitioners should compute domain-specific ICCs from baseline data wherever feasible.
#> 4                                                                                                                                                Magenta Book canonical evaluation question taxonomy with methods and chapter references. Sub-types (e.g. 'attribution', 'fidelity') are conventional categories used across HMG evaluation practice.