Simpler Overview Version — Good starting point for practitioners. View expanded technical version →
The Unjournal - Pivotal Questions Initiative

Linear WELLBY: Validity & Reliability

A technical analysis for comparing interventions in LMICs

← Back to WELLBY Discussion
Annotate & Comment: Double-click any text to add a Hypothes.is annotation, or look for the <> icon in the top-right corner to open the annotation sidebar and see existing comments. No account needed to read; free account to post.
⚠️ AI-Generated Content (March 2025) — click to expand

This page was created through iterative prompting of Claude Code (Opus 4.5) and GPT-5.2 Pro, feeding in workshop discussion content and focal papers for our Pivotal Questions initiative.

See primary sources[*]Benjamin et al. 2023/2024, UK Green Book Wellbeing Guidance (HM Treasury 2021/2024), Bond & Lang 2019, Kaiser & Oswald 2022, Haushofer & Shapiro 2016/2018, World Happiness Report WELLBY chapter (Helliwell et al. 2021), OECD Guidelines on Measuring Subjective Well-being (2013/2024). and additional sources[†]Kaiser & Lepinteur 2025, Ferrer-i-Carbonell & Frijters 2004, King et al. 2004, Krueger & Schkade 2008, Liu & Netzer 2023, Fabian 2022, Plant 2025 (Grice-Schelling rational response theory)..

This content requires further human verification. Specific claims, citations, and numerical details should be checked against the original literature before relying on them.

Purpose: This page provides technical background for workshop discussions. Content draws on academic sources and UK government guidance; specific claims should be verified against cited literature.

1. Definition and Use-Case

A WELLBY (wellbeing-adjusted life year) is a "common currency" approach: summarize an intervention's welfare impact as the change in subjective wellbeing per person-year, so interventions involving different mechanisms and outcomes(health, income, education, mental health) can be compared on a single scale.[1]

Unit Definition

1 WELLBY = a one-point change in life satisfaction (0-10 scale) × 1 person × 1 year

Source: UK Green Book Wellbeing Guidance (HM Treasury, 2021/2024)

Origins, alternative definitions, and adoption

Original proposal: The WELLBY concept was introduced by Frijters, Clark, Krekel & Layard (2020) in Health Economics as a unit for comparing wellbeing gains across policy domains.

UK Government adoption: The UK Treasury Green Book Wellbeing Guidance (2021, updated 2024) formally incorporated WELLBYs into government cost-benefit analysis.

Alternative definitions: While most usage defines WELLBY using life satisfaction (Cantril ladder, 0-10), some researchers use affect-based measures (experienced happiness, positive/negative affect). The choice matters: life satisfaction captures evaluative wellbeing; affect captures momentary experience.

Organizational adoption:

  • Happier Lives Institute: Uses WELLBYs as primary metric for comparing charities
  • Founders Pledge: Incorporates WELLBYs alongside DALYs in mental health CEA
  • GiveWell: Has explored WELLBY-based analysis but not fully adopted
  • UK Government: Official guidance for policy appraisal

See also: Frijters et al. (2024) in Nature for a recent overview.

The standard question (OECD prototype): "Overall, how satisfied are you with your life as a whole these days?" with anchors 0 = "not at all satisfied" and 10 = "completely satisfied".[2]

Incremental vs. Level-Based WELLBYs

A crucial distinction: are we summing changes relative to a counterfactual, or levels of wellbeing? These have different requirements.

Incremental WELLBYs (what most intervention comparisons need):
ΔWELLBY(k) = Σi Σt δt (LSit(k) − LSit(0))
Notation key
  • i = individual (summing across all people in the program)
  • t = time period (summing across years of effect)
  • LS = Life Satisfaction score (0-10 scale)
  • δ = discount factor for future years (often 1 if not discounting; e.g., 0.97 for 3% annual discount)
  • k = intervention group
  • 0 = counterfactual/control group

In practice, RCT designs estimate LS(k) − LS(0) directly via experimental comparison rather than measuring the counterfactual literally.

Level-based WELLBYs (for mortality comparisons):
W = Σi Σt δt LSit

This second form requires a defined zero point (e.g., death = 0) and is not invariant to shifting the LS scale. The World Happiness Report makes "dead people score 0" an explicit assumption for level-based accounting.[3]

When does level-based accounting matter?

Level-based WELLBYs matter when:

  • Comparing interventions that change mortality rates (e.g., malaria prevention vs. psychotherapy)
  • Interventions that affect birth rates (in some accounting frameworks)
  • Any comparison where the number of person-years differs between scenarios

How dead people enter the sum: Dead people contribute 0 to the sum (they have no life satisfaction). This means saving a life adds (years gained × average LS) to the WELLBY total—but this calculation requires knowing where on the scale corresponds to "neutral" welfare.

For interventions that don't change population (same people living the same number of years), incremental and level-based approaches are mathematically equivalent—the constant drops out when comparing.

Key takeaway: If your comparisons are mainly incremental changes among living people, the neutral point often cancels out. If you compare interventions that affect mortality (or, in some accounting, birth rates), where the calculation involves life-years saved × average wellbeing, the origin becomes load-bearing.

2. When the Neutral Point Matters

There are two different "zeros" people reference:

For incremental comparisons (program A vs. program B among the living), adding a constant c to everyone's LS doesn't change ΔLS. The neutral point is mostly irrelevant.

But for mortality comparisons—e.g., "WELLBYs from life extension = (life-years gained) × (average wellbeing level)"—the origin matters because a shift changes the integral.

Example: Why Origin Matters for Mortality Comparisons

0 1 2 3 4 5 6 7 8 9 10
2

Scenario: An LMIC program prevents a death, yielding 40 additional life-years at average LS = 5.

If neutral = 0: Benefit = 40 × 5 = 200 WELLBYs

If neutral = 2: Benefit = 40 × (5 − 2) = 120 "above-neutral" WELLBYs

This is a ranking-relevant difference when comparing mortality-focused interventions (malaria prevention) to non-mortality wellbeing programs (psychotherapy).

UK Green Book guidance discusses an indifference point around LS ≈ 2 for some respondents, while adopting a working assumption of 1. Peasgood et al. (2018) provide tentative estimates but acknowledge high uncertainty.[4]

3. Assumptions for Validity

Using linear WELLBYs for cross-intervention comparison requires several assumptions. A common overstatement is that "equal scores mean equal welfare"—but this is stronger than most applications need.

Level Comparability vs. Unit-Change Comparability

The key distinction is between two claims:

Formally, suppose each person has a reporting function[rf]A reporting function fi(·) maps a person's underlying ("true") welfare u to their reported life satisfaction score. It captures how individuals translate internal states into numerical responses—including factors like personality, cultural norms, and interpretation of scale anchors.: LSit = fi(uit) + εit. If fi differs across people, equal reported levels don't mean equal welfare. But it can still work for differences if the heterogeneity is "only a shift."

Shifters vs. Stretchers

A useful decomposition is the affine model:

uit = ai + bi × LSit

S Shifters Only (ai varies, b constant)

Different intercepts: "some people always report +2 higher." Levels are not comparable, but differences are: Δu = b × ΔLS. Less damaging for incremental comparisons.

T Stretchers (bi varies)

Different slopes: "some people compress the scale." Both levels AND differences fail: Δui = bi × ΔLSi. Can reverse cross-population comparisons.

This distinction is not merely theoretical. Recent work on scale-use heterogeneity explicitly models "shifters" and "stretchers" and proposes calibration-question approaches to adjust for them.[5]

When does scale-use heterogeneity break comparisons? It's most dangerous when it differs systematically across the units you're comparing in ways that don't cancel—e.g., comparing ΔLS across studies, countries, or populations with different distributions of "stretch factors."

When heterogeneity is less threatening

Within one RCT, same instrument[inst]An "instrument" here means the specific survey question used to measure life satisfaction—including exact wording, response scale (0-10 vs 1-7), anchor labels, and administration mode (face-to-face vs phone). The OECD prototype is one standardized instrument; different studies may use variations., random assignment: If treatment and control groups have the same distribution of reporting functions (by randomization), interpersonal noncomparability is less of a threat for estimating an average treatment effect. The main risks shift to measurement error noise and whether treatment changes scale-use itself (response shift).

Same population distribution: If E[b | Population A] ≈ E[b | Population B], then comparing interventions by ΔLS is valid for ranking even if intercepts differ.

When heterogeneity is most dangerous

Comparing across studies/countries: Different instruments, translations, norms, and populations. If the distribution of stretch factors bi differs, "1 point-year" is not the same welfare unit across the evidence base.

Ceiling/floor effects: Even with identical reporting functions, bounded scales can cause mechanical differences in responsiveness at high or low baselines.

The Four Core Assumptions (Expanded)

A Cardinality

U(LS=4) − U(LS=3) = U(LS=8) − U(LS=7)

Equal intervals on the scale imply equal welfare differences: moving from 3→4 equals the same welfare gain as 7→8. What breaks if it fails: Summing is invalid; interventions targeting different baselines incomparable. Mitigation: Test robustness to log transformation.

B Unit-Change Comparability

ΔLS has ≈ same welfare meaning across people

What breaks: Cross-person aggregation fails. Mitigation: Within-person designs; calibration vignettes; ensure same instrument across groups.

C Temporal Aggregation

W = ∫LS(t)dt is meaningful

What breaks: Duration weighting is wrong. Why it might fail: Adaptation effects—people return to baseline. Mitigation: Long-term follow-up data.

D Cross-Domain Validity

LS captures welfare from all domains

What breaks: LS misses physical suffering or other channels. Why it matters for LMICs: DALY-WELLBY conversion is fraught if LS only captures mood.

Response Shift: A Distinct Threat

Even if baseline scale-use heterogeneity cancels (by randomization), treatment can change the meaning of the respondent's self-evaluation. In quality-of-life research, this is called response shift: changes in internal standards, values, or conceptualization that alter how respondents answer over time.[6]

For wellbeing interventions in LMICs—especially psychosocial programs that may explicitly reframe cognition—response shift is not a corner case. If treatment changes the reporting function fi(·), observed ΔLS mixes "true welfare change" with "scale change," biasing the WELLBY estimate in either direction.

4. Empirical Evidence

Reliability: Noisy But Not Useless

OECD guidance synthesizes evidence that single-item life evaluations have test-retest correlations around 0.5-0.7 over short windows (1 day to 2 weeks), with country-level means showing much higher stability.[7]

Krueger and Schkade (2008) found life satisfaction measured two weeks apart had correlation ≈ 0.59. Two implications for WELLBY CEA:

LMIC Evidence: Life Satisfaction Responds to Interventions

GiveDirectly Cash Transfers (Kenya)

Short-run: Haushofer & Shapiro (2016) report that unconditional cash transfers improved psychological wellbeing, including ~0.17 SD in life satisfaction and ~0.16 SD in happiness (WVS measures).[8]

Long-run (~3 years): Life satisfaction increased by ~0.08 SD (statistically significant); overall psychological wellbeing index by ~0.16 SD.[9]

These findings show: (1) LS measures have detectable signal in LMIC RCTs, not just noise; (2) effects can persist, meaning "duration" in WELLBY accounting is empirically relevant. They do not solve cross-study comparability—but demonstrate that in at least one setting, SWB is responsive.

External Validity: Numbers Predict Behavior

A strong response to skepticism: even if the numbers seem arbitrary, do they behave like a measurement? Kaiser and Oswald show that single numeric feelings responses have strong predictive power—relationships to later "get-me-out-of-here" actions (changing neighborhoods, jobs, partners) tend to be replicable and close to linear in large longitudinal datasets.[10]

5. Critiques and Responses

The hardest critique is about what the data cannot pin down. Bond and Lang (2019) argue that with ordinal response data, comparing "average happiness" between groups is generally not identified without strong assumptions—monotonic transformations can reverse results.[11]

The Hard Critique
Responses and Mitigations
  • Predictive validity: SWB predicts consequential outcomes systematically
  • Survey response times can help solve identification (Liu & Netzer, AER 2023)
  • Explicit adjustment methods for scale-use heterogeneity exist
  • OECD (2024) concludes data remain meaningful for policy despite critiques

Clarifications for workshop discussions:

6. Alternatives and Relative Reliability

We cannot stop at "WELLBY is unreliable." Decision-makers must choose something, and the relevant question is comparative. "WELLBY vs. DALY" should be framed as a portfolio comparison problem under measurement uncertainty, not winner-take-all.

DALYs and QALYs: Standardized But Narrower

DALY (Disability-Adjusted Life Year) = Years of Life Lost (YLL) + Years Lived with Disability (YLD).[12]
QALY (Quality-Adjusted Life Year) = years of life × health utility weight (0-1 scale, 1 = perfect health).[13]

Approach Strengths (vs. WELLBYs) Weaknesses (vs. WELLBYs)
DALY / QALY Strong institutional standardization; large evidence bases; direct mortality/morbidity linkage May miss welfare impacts not in health-state descriptions (empowerment, security, meaning); mental health weights can be contentious
Linear WELLBY Captures what DALYs miss (mental health, subjective experience, non-health welfare); low respondent burden All issues discussed above; cross-study comparability uncertain
Multi-item SWB scales Higher reliability (averaging item noise); broader construct coverage Higher burden; translation/cognitive testing challenges; same conceptual issues
Calibrated WELLBY Reduces scale-use bias 30-50%[cal]Benjamin et al. (2023/2024) estimate that calibration methods using vignette anchoring and visual scales can reduce interpersonal comparability bias by 30-50%, depending on the population and calibration approach used. Complex; requires calibration items; LMIC feasibility unclear; introduces new assumptions

UK Green Book guidance explicitly suggests life satisfaction can reflect dimensions beyond standard QALY instruments (EQ-5D), making WELLBYs informative where QALYs are incomplete.[14]

A Hybrid Decision Rule

The Measurement Layer Problem

A practical constraint: many LMIC mental health studies report depression scales or symptom indices, not standard 0-10 life satisfaction. GiveWell's assessment of StrongMinds explicitly highlights uncertainty in translating depression improvements into life satisfaction gains.[15]

Even if you accept WELLBY as the target unit, the measurement layer forces choices:

7. WELLBY Calculator

Incremental WELLBY Estimate

Enter the average treatment effect on life satisfaction, duration of effect, number of beneficiaries, and program cost to estimate total WELLBYs and cost-effectiveness.

1,000
Total WELLBYs generated
Cost per WELLBY: $100

This simple calculator assumes constant effect size over the duration. Real applications should account for effect decay, discounting, and uncertainty ranges.

8. Practical Recommendations

For Funders Comparing Interventions

For Researchers Designing Studies

Conditions for Stronger Inference

9. Open Questions (Research Agenda)

These questions represent high-value areas for future research that could meaningfully improve the reliability of WELLBY-based comparisons:

  1. Neutral point estimation: What is the actual neutral point on the 0-10 scale for different populations? How stable is it across contexts?
  2. Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more in a given context?
  3. Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive respondent burden?
  4. WELLBY-DALY relationship: What's the true mapping between WELLBYs and DALYs, and is it linear? How much does it vary by health condition?
  5. Demand effects and response shift: How do experimenter demand effects and response shift vary by intervention type? Are psychosocial interventions particularly vulnerable?
  6. Instrument standardization: What survey instruments and protocols minimize cross-study comparability problems while remaining feasible in diverse settings?

For funders and researchers: If you're working on any of these questions, we'd be interested to hear about it. These are central to our Pivotal Questions initiative.

References & Notes

  1. Frijters, P. & Krekel, C. (2021). A Handbook for Wellbeing Policy-Making. Oxford University Press. See also World Happiness Report (2021) WELLBY chapter.
  2. OECD (2013). Guidelines on Measuring Subjective Well-being. OECD Publishing. Question modules available at oecd.org.
  3. Helliwell, J.F., et al. (2021). "The WELLBY." World Happiness Report 2021, Chapter 6.
  4. HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance. See discussion of QALY-LS mapping and indifference points.
  5. Benjamin, D.J., et al. (2023). "Scale-Use Heterogeneity." Working paper. Also: King, G., et al. (2004). "Enhancing the Validity of Cross-Cultural Comparability of Survey Research." American Political Science Review.
  6. Sprangers, M.A. & Schwartz, C.E. (1999). "Integrating response shift into health-related quality of life research." Social Science & Medicine, 48(11), 1507-1515.
  7. OECD (2024). "How's Life? 2024" and "Measuring Well-being" technical notes. Also: Krueger, A.B. & Schkade, D.A. (2008). "The reliability of subjective well-being measures." Journal of Public Economics, 92(8-9), 1833-1845.
  8. Haushofer, J. & Shapiro, J. (2016). "The Short-term Impact of Unconditional Cash Transfers to the Poor." Quarterly Journal of Economics, 131(4), 1973-2042.
  9. Haushofer, J. & Shapiro, J. (2018). "The Long-Term Impact of Unconditional Cash Transfers." Working paper.
  10. Kaiser, C. & Oswald, A.J. (2022). "The scientific value of numerical measures of human feelings." PNAS, 119(42).
  11. Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." Journal of Political Economy, 127(4), 1629-1640.
  12. WHO (2020). "Disability-Adjusted Life Years (DALYs)." Global Health Estimates methodology.
  13. NICE (UK). "Quality-adjusted life year (QALY)." NICE Glossary.
  14. HM Treasury wellbeing guidance notes that life satisfaction reflects dimensions beyond EQ-5D (the standard QALY instrument), including social relationships, meaning, and autonomy.
  15. GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds." Available at givewell.org.