The Unjournal · Pivotal Questions Initiative

Linear WELLBYs for Comparing Interventions

A briefing for workshop deliberation — concepts, evidence, and tradeoffs

← Back to WELLBY Discussion
Annotate & Comment: Double-click any text to add a Hypothes.is annotation. No account needed to read; quick signup for a free account to post.
Page versions: Simpler overview · Full technical (12 sections)
⚠️ AI-Generated Content (March 2026) — click to expand

This page was generated with AI assistance (Claude Code + ChatGPT deep research) and revised based on 75+ Hypothes.is workshop comments.

The content aims to be workshop-neutral, framing issues for deliberation rather than prescribing conclusions. Readers should verify quantitative claims against the original literature.

Audio Version (~25 min)

Listen to an audio narration of this page (British academic voice):

Download MP3 (8 MB) Text Script

Generated with Microsoft Edge TTS (en-GB-RyanNeural)

Companion Page: For DALY↔WELLBY conversion approaches →
DALY↔WELLBY Conversion

1. The decision problem

Organizations comparing interventions—especially in low- and middle-income countries (LMICs)—face a measurement problem: interventions change different things (mortality, morbidity, consumption, mental health, social cohesion). The WELLBY approach proposes translating these into a common unit based on subjective wellbeing, enabling "welfare impact per dollar" comparisons.[1]Frijters, Clark, Krekel & Layard (2020). "A Happy Choice: Wellbeing as the Goal of Government."

A focal question for this workshop: How reliably can we compare interventions by aggregating changes in reported wellbeing (the WELLBY approach), especially across different studies and contexts? This is one of several comparison frameworks; others include DALY/QALY-based approaches, capability approaches, and direct monetary valuation. The workshop examines the linear WELLBY's reliability relative to these alternatives, not in isolation.

The measurement-to-decision pipeline illustrates why comparing interventions requires multiple translation steps. Each box represents a stage where methodological choices affect final conclusions:

flowchart LR A[Intervention] --> B[Study design] B --> C[Measured outcomes
LS / DALY / depression] C --> D[Translation layer
mapping, calibration] D --> E[Common currency
WELLBY / DALY / $] E --> F[Decision]
How to read this diagram
  • Intervention → Study design: The program being evaluated is studied through some research design (RCT, quasi-experiment, etc.)
  • Study design → Measured outcomes: Studies measure different things—some use life satisfaction (LS), others use DALYs or depression scales
  • Measured outcomes → Translation layer: Different metrics must be mapped or calibrated to enable comparison
  • Translation layer → Common currency: The goal is a single unit (WELLBYs, DALYs, or dollars) enabling "apples-to-apples" comparison
  • Common currency → Decision: Funders use the common currency to prioritize interventions

Each arrow involves assumptions that can introduce error or bias. The workshop focuses on where these assumptions are most likely to matter.

Workshop goals: (1) Clarity about which assumptions matter most for which comparisons, and what evidence would change views. (2) Share information and synthesize participant expertise. (3) Generate practical insights and actionable recommendations for funders working with current evidence.

2. Definitions and key concepts

WELLBY Definition

1 WELLBY = a one-point change in life satisfaction (0-10 scale) × 1 person × 1 year

Source: UK Green Book Wellbeing Guidance (HM Treasury, 2021/2024)

Origins, alternative definitions, and adoption

Original proposal: Frijters, Clark, Krekel & Layard (2020) introduced WELLBYs in Health Economics as a unit for comparing wellbeing gains across policy domains.

Alternative definitions: Most usage defines WELLBY using life satisfaction (Cantril ladder, 0-10), but some researchers use affect-based measures (experienced happiness). The choice matters: life satisfaction captures evaluative wellbeing; affect captures momentary experience.

Organizational adoption:

Standard Life Satisfaction Questions

OECD single-item: "Overall, how satisfied are you with your life as a whole these days?" (0 = "not at all satisfied" to 10 = "completely satisfied")[2]OECD Guidelines on Measuring Subjective Well-being (2013/2024). Question modules available at oecd.org.

Cantril ladder (Gallup): "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you, and the bottom represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"

These two framings—satisfaction vs. ladder position—are often used interchangeably, but may capture subtly different constructs. Cross-study comparisons should note which instrument was used.

Incremental vs. Level-Based Accounting

Incremental WELLBYs (what most intervention comparisons need):

$\Delta W(k) = \sum_{i} \sum_{t} \delta^t \left( LS_{it}^{(k)} - LS_{it}^{(0)} \right)$

Level-based WELLBYs (for interventions that change mortality rates):

$W = \sum_{i} \sum_{t} \delta^t \, LS_{it}$
Notation key
  • $i$ = individual (summing across people)
  • $t$ = time period (summing across years)
  • $LS$ = Life Satisfaction score (0-10)
  • $\delta$ = discount factor for future years
  • $k$ = intervention; $0$ = counterfactual

In practice, RCTs estimate $LS^{(k)} - LS^{(0)}$ directly via experimental comparison.

Technical definitions (reporting function, instrument, latent distribution)

Instrument: The specific measurement tool—exact question wording, response format (0-10 vs 1-7), anchors, translation, survey mode.

Reporting function: The internal process by which a person translates their true wellbeing ($u$) into a number on the survey scale. Formally: $LS_{it} = f_i(u_{it}) + \varepsilon_{it}$. Different people may have different reporting functions—one person's "7" might correspond to another's "5" for the same underlying welfare level. This is the core of scale-use heterogeneity.

Latent distribution: The unobserved underlying welfare distribution. Since we only see reported scores, conclusions can depend on assumptions about this hidden distribution.

3. Core assumptions

Using linear WELLBYs for cross-intervention comparison requires assumptions. A common overstatement is that "equal scores mean equal welfare"—but this is stronger than most applications need.

Cardinality (linearity)

Equal intervals on the scale imply equal welfare differences: moving from 3→4 equals the same welfare gain as 7→8. If violated, summing may distort comparisons.[12]Plant (2025) explores conditions under which treating LS as cardinal is defensible. See also HLI's cost-effectiveness methodology.

Unit-change comparability

A one-point change has approximately the same welfare meaning across people. This is weaker than requiring equal levels to mean equal welfare.

Temporal aggregation

Integrating wellbeing over time is meaningful. May fail if adaptation returns people to baseline, or if respondents reinterpret the scale over time (response shift).

Cross-domain capture

Life satisfaction incorporates welfare from many domains (health, income, relationships), not merely transient "mood."

Time structure and discounting

The time structure shows how WELLBY estimation requires assumptions about effect persistence. Most studies measure outcomes at baseline and one or two follow-ups;[15]E.g., Haushofer & Shapiro (2016, 2018) measure SWB at 9 months and 3 years after Kenya cash transfers; StrongMinds studies typically follow up at 3-6 months post-treatment. extrapolating beyond measured timepoints involves uncertainty:

sequenceDiagram participant t0 as Baseline (t=0) participant t1 as Follow-up (t=1) participant tT as Later (t>1) Note over t0,tT: Persistence, decay, response shift? t0->>t1: Measure LS, mental health scales t1->>tT: Repeat for duration estimate Note over t0,tT: Discounting δ^t if aggregating

Studies measure at baseline and follow-up(s). Extrapolating to later periods requires assumptions about persistence (does the effect last?) and discounting (are future wellbeing gains worth less than present ones?).

Response Shift: a distinct threat

Even if baseline scale-use heterogeneity cancels (by randomization), treatment can change the meaning of the respondent's self-evaluation. This is called response shift: changes in internal standards, values, or conceptualization that alter how respondents answer over time.[rs]Sprangers & Schwartz (1999). "Integrating response shift into health-related quality of life research." Social Science & Medicine.

For wellbeing interventions—especially psychosocial programs that may explicitly reframe cognition—response shift is not a rare edge case. If treatment changes the reporting function $f_i(\cdot)$, observed $\Delta LS$ mixes "true welfare change" with "scale change," potentially biasing the WELLBY estimate in either direction.

4. A key critique: identification and transformations

Bond & Lang (2019) argue that with ordinal happiness data, comparing "average happiness" between groups is not identified without strong assumptions—monotonic transformations can reverse conclusions.[3]Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." JPE, 127(4). While their critique focuses on cross-group comparisons (e.g., "are Germans happier than French?"), similar issues can arise when comparing intervention effects measured at different baseline levels or across different populations.

What "non-identified" means

A parameter is "identified" when data + assumptions pin down a unique value. Ordinal responses only tell us which interval a latent value falls into. Many different latent distributions and transformations can generate the same observed category counts, so rankings of means can change across equally admissible representations.

Why this matters for intervention comparison

Transformation Sensitivity Demo

See how monotone transformations of the LS scale can change mean comparisons.

1.0

Try it: In "effects" mode, move θ from 1.0 toward 2.0 and watch the ranking flip. Intervention B has a larger raw effect (2 vs 1 point), but A's effect occurs at higher LS levels. Under convex transformations (θ>1), gains at high levels are amplified—so A can dominate despite smaller raw gains.

5. Scale-use heterogeneity: shifters vs. stretchers

A useful decomposition is the affine model, which separates two types of scale-use differences across respondents:[11]The "shifters vs. stretchers" framework derives from Benjamin et al. (2012, 2014, 2023). See also Oswald (2008) and Kaiser & Oswald (2022) on scale-use heterogeneity.

$u_{it} = a_i + b_i \cdot LS_{it}$

Shifters (different $a_i$)

Different intercepts: "some people always report +2 higher." Levels are not comparable, but differences are: $\Delta u = b \cdot \Delta LS$.

Stretchers (different $b_i$)

Different slopes: "some people compress the scale." Both levels AND differences fail: $\Delta u_i = b_i \cdot \Delta LS_i$. Can reverse cross-population comparisons.

Benjamin et al. propose calibration questions to identify and adjust for scale-use heterogeneity—questions designed to have the same objective answer across respondents.[4]Benjamin et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.

Shifter vs. Stretcher Demo

Compare two populations with different scale use. See why stretchers distort intervention comparisons.

Population A
Population B
Why fixed effects only remove shifts

Fixed effects absorb level differences (the $a_i$ terms). But if people have different $b_i$ (stretch factors), the implied welfare change per reported point differs.

6. Neutral point and mortality

Two different "zeros" people reference:

For incremental comparisons among the living, the neutral point often cancels out—that is, when comparing ΔLS between intervention and control groups, both are measured from the same implicit baseline, so the zero point subtracts from both sides and doesn't affect the difference.[14]Mathematically: (LS_treatment − LS_0) − (LS_control − LS_0) = LS_treatment − LS_control. The neutral point LS_0 cancels. But for mortality comparisons—"WELLBYs from life extension = life-years × average wellbeing"—the origin is load-bearing.

Neutral Point / Mortality Demo

2

Scenario: Mortality intervention prevents a death, yielding 40 additional life-years at average LS = 5.

If neutral = 0: Benefit = 40 × 5 = 200 WELLBYs

If neutral = 2: Benefit = 40 × (5 − 2) = 120 "above-neutral" WELLBYs

When does the neutral point matter? When comparing mortality-focused interventions to non-mortality wellbeing programs. For comparisons among living people only, it typically cancels (differences are unaffected by the choice of zero point).

7. Evidence and alternatives

Reliability: noisy but not useless

Single-item life evaluations have test-retest correlations around 0.5-0.7 over short windows.[5]Krueger & Schkade (2008). "The reliability of subjective well-being measures." This means measurement error attenuates estimated effects—small real effects may be undervalued.

Predictive validity

Kaiser & Oswald show that single numeric feelings responses predict consequential outcomes (changing neighborhoods, jobs, partners)—relationships tend to be replicable and close to linear.[6]Kaiser & Oswald (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS.

LMIC evidence: GiveDirectly cash transfers

Kenya RCT (Haushofer & Shapiro)

  • Short-run (~9 months): Life satisfaction +0.17 SD, happiness +0.16 SD (WVS measures)
  • Long-run (~3 years): Life satisfaction +0.08 SD (statistically significant); psychological wellbeing index +0.16 SD

These findings show LS measures have detectable signal in LMIC RCTs, not just noise—and effects can persist.[7]Haushofer & Shapiro (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. Also 2018 long-term follow-up working paper.

The measurement layer problem

Many LMIC mental health studies report depression scales or symptom indices, not standard 0-10 life satisfaction. GiveWell's assessment of StrongMinds explicitly highlights uncertainty in translating depression improvements into life satisfaction gains.[gw]GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds." Notes substantial uncertainty in depression-to-LS mapping.

Even if you accept WELLBY as the target unit, the measurement layer forces choices: use DALYs/QALYs (more standard in health evaluation) even if they miss non-health welfare, use life satisfaction directly but only where trials collect it, or use mapping models (depression → LS) but carry mapping uncertainty explicitly.

Comparison with alternatives

Metric Strengths Weaknesses
WELLBY Captures non-health welfare; direct self-report; low burden Scale-use, comparability assumptions; cross-study issues
DALY/QALY Standardized; large evidence bases; direct mortality link May miss non-health welfare; mental health disability weights contentious
Calibrated WELLBY Reduces scale-use bias ~30-50% Complex; LMIC feasibility unclear; new assumptions

8. WELLBY Calculator

Incremental WELLBY Estimate

Enter treatment effect, duration, recipients, and cost to estimate total WELLBYs and cost-effectiveness.

1,000
Total WELLBYs generated
Cost per WELLBY: $100

This calculator assumes constant effect size. Real applications should account for effect decay, discounting, and uncertainty.

9. Practical considerations

For funders comparing interventions

For researchers designing studies

Conditions for stronger inference

10. Open questions (research agenda)

High-value areas for future research that could meaningfully improve the reliability of WELLBY-based comparisons:

  1. Neutral point estimation: What is the actual neutral point on the 0-10 scale for different populations? How stable is it across contexts?
  2. Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more?
  3. Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive burden?
  4. WELLBY-DALY relationship: What's the mapping between WELLBYs and DALYs, and is it linear? How much does it vary by health condition?
  5. Demand effects and response shift: How do experimenter demand effects and response shift vary by intervention type?

11. Workshop prompts

Neutral prompts for workshop deliberation:

1. For which classes of intervention comparisons (same setting/instrument vs. cross-study) does the linear WELLBY seem most defensible, and why?

2. Which assumptions are most likely to be materially violated in LMIC contexts: linearity, intertemporal comparability, interpersonal comparability, or scale-use heterogeneity?

3. When does the neutral point become decision-relevant? Which "zero" do you have in mind?

4. How should analysts treat "mapping" between depression scales and life satisfaction when LS isn't measured? What minimum evidence would make a mapping credible?

5. Which low-burden calibration approaches seem most promising for LMIC settings?

6. Practical recommendations: What should funders do now, given current evidence and uncertainty? What specific guidance can we offer for making decisions while better evidence is developed?

References

  1. Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Health Economics, 29(12).
  2. OECD (2013/2024). Guidelines on Measuring Subjective Well-being.
  3. Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." Journal of Political Economy, 127(4).
  4. Benjamin, D.J. et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
  5. Krueger, A.B. & Schkade, D.A. (2008). "The reliability of subjective well-being measures." Journal of Public Economics.
  6. Kaiser, C. & Oswald, A.J. (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS, 119(42).
  7. Haushofer, J. & Shapiro, J. (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. See also 2018 long-term follow-up.
  8. HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.
  9. Helliwell, J.F., et al. (2021). "The WELLBY." World Happiness Report 2021, Chapter 6.
  10. GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds."
  11. Benjamin, D.J., Heffetz, O., Kimball, M.S. & Szembrot, N. (2014). "Beyond Happiness and Satisfaction: Toward Well-Being Indices Based on Stated Preference." AER, 104(9). The shifters/stretchers framework is elaborated in Benjamin et al. (2023) NBER WP 31728.
  12. Plant, M. (2025). "A Happy Possibility About Happiness Scales: An Exploration of the Cardinality Assumption." Happier Lives Institute working paper.
  13. Happier Lives Institute (2023-2025). Cost-effectiveness analyses of mental health interventions using WELLBY methodology. See happierlivesinstitute.org.
  14. When comparing intervention effects for living people, the neutral point cancels algebraically: (LS_treatment − LS₀) − (LS_control − LS₀) = LS_treatment − LS_control.
  15. LMIC study examples: Haushofer & Shapiro (2016, 2018) measured SWB at 9 months and 3 years post cash transfers in Kenya; StrongMinds evaluations typically follow up at 3-6 months.

Related Analysis

For discussion of how to convert between DALYs/QALYs and WELLBYs:

DALY/QALY↔WELLBY Conversion →

Note: AI-generated draft requiring verification