The Unjournal · Pivotal Questions Initiative

Linear WELLBYs for Comparing Interventions

A briefing for workshop deliberation — concepts, evidence, and tradeoffs

← Back to WELLBY Discussion
Annotate & Comment: Double-click any text to add a Hypothes.is annotation. No account needed to read; quick signup for a free account to post.
⚠️ AI-Generated Content (March 2026) — click to expand

This page was generated with AI assistance (Claude Code + ChatGPT deep research) and revised based on 75+ Hypothes.is workshop comments.

The content aims to be workshop-neutral, framing issues for deliberation rather than prescribing conclusions. Readers should verify quantitative claims against the original literature.

1. The decision problem

Organizations comparing interventions—especially in low- and middle-income countries (LMICs)—face a measurement problem: interventions change different things (mortality, morbidity, consumption, mental health, social cohesion). The WELLBY approach proposes translating these into a common unit based on subjective wellbeing, enabling "welfare impact per dollar" comparisons.[1]Frijters, Clark, Krekel & Layard (2020). "A Happy Choice: Wellbeing as the Goal of Government."

A focal question for this workshop: How reliably can we compare interventions by aggregating changes in reported wellbeing (the WELLBY approach), especially across different studies and contexts? This is one of several comparison frameworks; others include DALY/QALY-based approaches, capability approaches, and direct monetary valuation.

flowchart LR A[Intervention] --> B[Study design] B --> C[Measured outcomes
LS / DALY / depression] C --> D[Translation layer
mapping, calibration] D --> E[Common currency
WELLBY / DALY / $] E --> F[Decision]

Workshop goals: (1) Clarity about which assumptions matter most for which comparisons, and what evidence would change views. (2) Share information and synthesize participant expertise. (3) Generate practical insights and actionable recommendations for funders working with current evidence.

2. Definitions and key concepts

WELLBY Definition

1 WELLBY = a one-point change in life satisfaction (0-10 scale) × 1 person × 1 year

Source: UK Green Book Wellbeing Guidance (HM Treasury, 2021/2024)

Standard Life Satisfaction Questions

OECD single-item: "Overall, how satisfied are you with your life as a whole these days?" (0 = "not at all satisfied" to 10 = "completely satisfied")[2]OECD Guidelines on Measuring Subjective Well-being (2013/2024).

Cantril ladder (Gallup): "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you, and the bottom represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"

These two framings—satisfaction vs. ladder position—are often used interchangeably, but may capture subtly different constructs.

Incremental vs. Level-Based Accounting

Incremental WELLBYs (what most intervention comparisons need):

$\Delta W(k) = \sum_{i} \sum_{t} \delta^t \left( LS_{it}^{(k)} - LS_{it}^{(0)} \right)$

Level-based WELLBYs (for mortality comparisons):

$W = \sum_{i} \sum_{t} \delta^t \, LS_{it}$
Notation key
  • $i$ = individual (summing across people)
  • $t$ = time period (summing across years)
  • $LS$ = Life Satisfaction score (0-10)
  • $\delta$ = discount factor for future years
  • $k$ = intervention; $0$ = counterfactual

In practice, RCTs estimate $LS^{(k)} - LS^{(0)}$ directly via experimental comparison.

Technical definitions (reporting function, instrument, latent distribution)

Instrument: The specific measurement tool—exact question wording, response format (0-10 vs 1-7), anchors, translation, survey mode.

Reporting function: $f_i(\cdot)$ mapping latent welfare to reported score: $LS_{it} = f_i(u_{it}) + \varepsilon_{it}$

Latent distribution: The unobserved underlying welfare distribution. Since we only see reported scores, conclusions can depend on assumptions about this hidden distribution.

3. Core assumptions

Using linear WELLBYs for cross-intervention comparison requires assumptions. A common overstatement is that "equal scores mean equal welfare"—but this is stronger than most applications need.

Cardinality (linearity)

Equal intervals on the scale imply equal welfare differences: moving from 3→4 equals the same welfare gain as 7→8. If violated, summing may distort comparisons.

Unit-change comparability

A one-point change has approximately the same welfare meaning across people. This is weaker than requiring equal levels to mean equal welfare.

Temporal aggregation

Integrating wellbeing over time is meaningful. May fail if adaptation returns people to baseline, or if respondents reinterpret the scale over time (response shift).

Cross-domain capture

Life satisfaction incorporates welfare from many domains (health, income, relationships), not merely transient "mood."

Time structure and discounting

sequenceDiagram participant t0 as Baseline (t=0) participant t1 as Follow-up (t=1) participant tT as Later (t>1) Note over t0,tT: Persistence, decay, response shift? t0->>t1: Measure LS, MH scales t1->>tT: Repeat for duration estimate Note over t0,tT: Discounting δ^t if aggregating

4. The key critique: identification and transformations

Bond & Lang (2019) argue that with ordinal happiness data, comparing "average happiness" between groups is not identified without strong assumptions—monotonic transformations can reverse conclusions.[3]Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." JPE, 127(4).

What "non-identified" means

A parameter is "identified" when data + assumptions pin down a unique value. Ordinal responses only tell us which interval a latent value falls into. Many different latent distributions and transformations can generate the same observed category counts, so rankings of means can change across equally admissible representations.

Why this matters for intervention comparison

Transformation Sensitivity Demo

See how monotone transformations of the LS scale can change mean comparisons.

1.0

Note: Toy data for illustration. The principle—that ordinal data + different cardinalizations can yield different rankings—is general.

5. Scale-use heterogeneity: shifters vs. stretchers

A useful decomposition is the affine model:

$u_{it} = a_i + b_i \cdot LS_{it}$

Shifters (different $a_i$)

Different intercepts: "some people always report +2 higher." Levels are not comparable, but differences are: $\Delta u = b \cdot \Delta LS$.

Stretchers (different $b_i$)

Different slopes: "some people compress the scale." Both levels AND differences fail: $\Delta u_i = b_i \cdot \Delta LS_i$. Can reverse cross-population comparisons.

Benjamin et al. propose calibration questions to identify and adjust for scale-use heterogeneity—questions designed to have the same objective answer across respondents.[4]Benjamin et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.

Shifter vs. Stretcher Demo

Compare two populations with different scale use. See why stretchers distort intervention comparisons.

Population A
Population B
Why fixed effects only remove shifts

Fixed effects absorb level differences (the $a_i$ terms). But if people have different $b_i$ (stretch factors), the implied welfare change per reported point differs.

6. Neutral point and mortality

Two different "zeros" people reference:

For incremental comparisons among the living, the neutral point often cancels out. But for mortality comparisons—"WELLBYs from life extension = life-years × average wellbeing"—the origin is load-bearing.

Neutral Point / Mortality Demo

2

Scenario: Mortality intervention prevents a death, yielding 40 additional life-years at average LS = 5.

If neutral = 0: Benefit = 40 × 5 = 200 WELLBYs

If neutral = 2: Benefit = 40 × (5 − 2) = 120 "above-neutral" WELLBYs

When does the neutral point matter? When comparing mortality-focused interventions to non-mortality wellbeing programs. For comparisons among living people only, it typically cancels.

7. Evidence and alternatives

Reliability: noisy but not useless

Single-item life evaluations have test-retest correlations around 0.5-0.7 over short windows.[5]Krueger & Schkade (2008). "The reliability of subjective well-being measures." This means measurement error attenuates estimated effects—small real effects may be undervalued.

Predictive validity

Kaiser & Oswald show that single numeric feelings responses predict consequential outcomes (changing neighborhoods, jobs, partners)—relationships tend to be replicable and close to linear.[6]Kaiser & Oswald (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS.

LMIC evidence

Haushofer & Shapiro's Kenya cash transfer RCT shows life satisfaction measures are responsive in LMIC experiments—~0.17 SD short-run, ~0.08 SD at 3 years.[7]Haushofer & Shapiro (2016/2018). Kenya GiveDirectly cash transfer studies.

Comparison with alternatives

Metric Strengths Weaknesses
WELLBY Captures non-health welfare; direct self-report; low burden Scale-use, comparability assumptions; cross-study issues
DALY/QALY Standardized; large evidence bases; direct mortality link May miss non-health welfare; MH weights contentious
Calibrated WELLBY Reduces scale-use bias ~30-50% Complex; LMIC feasibility unclear; new assumptions

8. WELLBY Calculator

Incremental WELLBY Estimate

Enter treatment effect, duration, recipients, and cost to estimate total WELLBYs and cost-effectiveness.

1,000
Total WELLBYs generated
Cost per WELLBY: $100

This calculator assumes constant effect size. Real applications should account for effect decay, discounting, and uncertainty.

9. Workshop prompts

Neutral prompts for workshop deliberation:

1. For which classes of intervention comparisons (same setting/instrument vs. cross-study) does the linear WELLBY seem most defensible, and why?

2. Which assumptions are most likely to be materially violated in LMIC contexts: linearity, intertemporal comparability, interpersonal comparability, or scale-use heterogeneity?

3. When does the neutral point become decision-relevant? Which "zero" do you have in mind?

4. How should analysts treat "mapping" between depression scales and life satisfaction when LS isn't measured? What minimum evidence would make a mapping credible?

5. Which low-burden calibration approaches seem most promising for LMIC settings?

References

  1. Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Health Economics, 29(12).
  2. OECD (2013/2024). Guidelines on Measuring Subjective Well-being.
  3. Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." Journal of Political Economy, 127(4).
  4. Benjamin, D.J. et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
  5. Krueger, A.B. & Schkade, D.A. (2008). "The reliability of subjective well-being measures." Journal of Public Economics.
  6. Kaiser, C. & Oswald, A.J. (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS, 119(42).
  7. Haushofer, J. & Shapiro, J. (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. See also 2018 long-term follow-up.
  8. HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.
  9. Helliwell, J.F., et al. (2021). "The WELLBY." World Happiness Report 2021, Chapter 6.
  10. GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds."

🎧 Audio Resources

Audio script for NotebookLM or text-to-speech generation:

📄 Linear WELLBY Audio Script

Use with NotebookLM to generate a podcast-style audio overview