The Unjournal · Pivotal Questions Initiative

Linear WELLBYs for Comparing Interventions

Workshop: March 16, 2026 · 11am–4pm ET / 3pm–8pm UK

A briefing for workshop deliberation — concepts, evidence, and tradeoffs

✓ The workshop took place on March 16, 2026

We're continuing the discussion asynchronously and will be publicly sharing key materials soon. This site is evolving into a resource page.
Share your feedback → · Complete the beliefs elicitation · Join the discussion (Google Doc)

View archived version from workshop day

← Back to WELLBY Discussion
Annotate & Comment: Double-click any text to add a Hypothes.is annotation. No account needed to read; quick signup for a free account to post.
Source document: Full Deep Research Report — detailed notation, formal frameworks, and extensive citations
⚠️ AI-Generated Content (March 2026) — click to expand

This page was generated with AI assistance (Claude Code + ChatGPT deep research) and revised based on 75+ Hypothes.is workshop comments.

The content aims to be workshop-neutral, framing issues for deliberation rather than prescribing conclusions. Readers should verify quantitative claims against the original literature.

Audio Version (~25 min)

Listen to an audio narration of this page (British academic voice):

Download MP3 (8 MB) Text Script

Generated with Microsoft Edge TTS (en-GB-RyanNeural)

Companion Page: For DALY↔WELLBY conversion approaches →
DALY↔WELLBY Conversion

1. The decision problem

Organizations comparing interventions—especially in low- and middle-income countries (LMICs)—face a measurement problem: interventions change different things (mortality, morbidity, consumption, mental health, social cohesion). The WELLBY approach proposes translating these into a common unit based on subjective wellbeing, enabling "welfare impact per dollar" comparisons.[1]Frijters, Clark, Krekel & Layard (2020). "A Happy Choice: Wellbeing as the Goal of Government."

A focal question for this workshop: How reliably can we compare interventions, especially in LMICs, by aggregating changes in reported wellbeing (the WELLBY approach), especially across different studies and contexts? This is one of several comparison frameworks; others include DALY/QALY-based approaches, capability approaches, and direct monetary valuation. The workshop examines the linear WELLBY's reliability relative to these alternatives, not in isolation.

🔬 The Focal Case: StrongMinds vs AMF (click to expand)

The debate over whether mental health interventions outperform anti-malaria bednets illustrates why WELLBY methodology matters. This comparison—central to effective altruism discourse—hinges on contested assumptions about measuring and aggregating wellbeing gains.

The Interventions

StrongMinds

Group interpersonal therapy (IPT-G) for depression, primarily in Uganda, Zambia, and other African countries. Community health workers deliver 12-week group sessions to women with moderate-to-severe depression.

Against Malaria Foundation (AMF)

Distributes long-lasting insecticide-treated bednets (LLINs) in malaria-endemic regions. GiveWell's top-rated charity for over a decade, with extensive mortality reduction evidence.

Friendship Bench

Problem-solving therapy delivered by trained community health workers ("grandmother counselors") on park benches in Zimbabwe. The Chibanda et al. (2016) cluster RCT showed significant effects on depression symptoms.

GiveDirectly

Unconditional cash transfers (~$1,000) to poor households in Kenya and Uganda. Haushofer & Shapiro (2016) RCT measured life satisfaction directly, finding 0.16 SD improvement at 9 months.

What Data Was Collected?

Intervention Study Design Primary Outcome Follow-up
StrongMinds Multiple RCTs; also quasi-experimental PHQ-9 (depression scale); some LS 3-6 months typical
AMF Cochrane meta-analyses of bednet RCTs Child mortality; malaria cases 1-2 years
Friendship Bench Cluster RCT (Chibanda 2016) SSQ-14 (Shona symptom scale) 6 months
GiveDirectly RCTs (Haushofer-Shapiro) Life satisfaction (0-10) 9 months; 3-year follow-up

The Translation Challenge

To compare these interventions, evaluators must translate different metrics into a common currency:

  • AMF → WELLBYs: Convert DALYs averted to life-years, then assign WELLBY values to those years. Requires assumptions about the wellbeing level of lives saved.
  • StrongMinds → WELLBYs: Convert depression scale changes (PHQ-9) to life satisfaction changes using cross-sectional correlations. Then extrapolate effect duration beyond measured follow-up.
  • GiveDirectly → WELLBYs: Direct measurement of LS changes, but requires duration assumptions (do effects persist?).
The Controversy: Happier Lives Institute estimated StrongMinds at ~40 WELLBYs per $1,000 (November 2024)—potentially more cost-effective than AMF. GiveWell's 2023 assessment disagreed, citing concerns about: (1) mapping depression scales to LS, (2) assumed effect duration, (3) demand effects in self-reported outcomes, and (4) publication bias. See HLI's response.

📚 Further reading: See Unjournal evaluations of mental health research for independent expert assessments of the underlying evidence.

The measurement-to-decision pipeline illustrates why comparing interventions requires multiple translation steps. Each box represents a stage where methodological choices affect final conclusions:

%%{init: {'theme': 'default', 'themeVariables': {'fontSize': '18px'}}}%% flowchart LR A["Intervention
(e.g., cash transfer,
mental health program)"] --> B["Study design
(RCT, quasi-experiment)"] B --> C["Measured outcomes
LS / DALY / depression scale"] C --> D["Translation layer
mapping, calibration,
assumptions"] D --> E["Common currency
WELLBY / DALY / $"] E --> F["Decision /
deliberation"]
How to read this diagram (click to collapse)
  • Intervention → Study design: The program being evaluated is studied through some research design (RCT, quasi-experiment, etc.)
  • Study design → Measured outcomes: Studies measure different things—some use life satisfaction (LS), others use DALYs or depression scales
  • Measured outcomes → Translation layer: Different metrics must be mapped or calibrated to enable comparison
  • Translation layer → Common currency: The goal is a single unit (WELLBYs, DALYs, or dollars) enabling "apples-to-apples" comparison
  • Common currency → Decision: Funders use the common currency to prioritize interventions

Each arrow involves assumptions that can introduce error or bias. The workshop focuses on where these assumptions are most likely to matter.

Workshop goals: (1) Clarity about which assumptions matter most for which comparisons, and what evidence would change views. (2) Share information and synthesize participant expertise. (3) Generate practical insights and actionable recommendations for funders working with current evidence.

2. Definitions and key concepts

WELLBY Definition

1 WELLBY = a one-point change in life satisfaction (0-10 scale) × 1 person × 1 year

Standard LS question (ONS/UK): "Overall, how satisfied are you with your life nowadays? Please answer on a scale of 0 to 10, where 0 means 'not at all satisfied' and 10 means 'completely satisfied'."

Source: UK Green Book Wellbeing Guidance (HM Treasury, 2021/2024)

Origins, alternative definitions, and adoption

Original proposal: Frijters, Clark, Krekel & Layard (2020) introduced WELLBYs in Health Economics as a unit for comparing wellbeing gains across policy domains.

Alternative definitions: Most usage defines WELLBY using life satisfaction (Cantril ladder, 0-10), but some researchers use affect-based measures (experienced happiness). The choice matters: life satisfaction captures evaluative wellbeing; affect captures momentary experience.

Organizational adoption:

Standard Life Satisfaction Questions

OECD single-item: "Overall, how satisfied are you with your life as a whole these days?" (0 = "not at all satisfied" to 10 = "completely satisfied")[2]OECD Guidelines on Measuring Subjective Well-being (2013/2024). Question modules available at oecd.org.

Cantril ladder (Gallup): "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you, and the bottom represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"

These two framings—satisfaction vs. ladder position—are often used interchangeably, but may capture subtly different constructs. Cross-study comparisons should note which instrument (survey question format) was used.

Incremental vs. Level-Based Frameworks

Incremental WELLBYs[18]"Incremental" vs "level-based" is descriptive language used in this document. The literature typically just refers to "WELLBYs" with context clarifying whether the focus is on changes (differences) or absolute levels. Incremental WELLBYs are captured by observing differences between comparable treated and untreated populations—useful when mortality effects are negligible or balanced across groups.:

$\Delta W(k) = \sum_{i} \sum_{t} \delta^t \left( LS_{it}^{(k)} - LS_{it}^{(0)} \right)$

Where $LS_{it}$ is person $i$'s life satisfaction at time $t$ (on a 0-10 scale), $\delta$ is a discount factor for future periods, $k$ indexes the intervention, and $0$ is the counterfactual (no intervention). The sum runs over all individuals $i$ and time periods $t$. See notation key below for details.

Level-based WELLBYs (relevant for comparing interventions that affect mortality or, in some conceptual frameworks, birth rates):

$W = \sum_{i} \sum_{t} \delta^t \, LS_{it}$
Notation key
  • $i$ = individual (summing across people)
  • $t$ = time period (summing across years)
  • $LS$ = Life Satisfaction score (0-10)
  • $\delta$ = discount factor for future years
  • $k$ = intervention; $0$ = counterfactual

In practice, RCTs estimate $LS^{(k)} - LS^{(0)}$ directly via experimental comparison of treatment and control group outcomes.

Technical definitions (reporting function, instrument, latent distribution)

Instrument: The specific measurement tool—exact question wording, response format (0-10 vs 1-7), anchors, translation, survey mode.

Reporting function: The internal process by which a person translates their true wellbeing ($u$) into a number on the survey scale. Formally: $LS_{it} = f_i(u_{it}) + \varepsilon_{it}$. Different people may have different reporting functions—one person's "7" might correspond to another's "5" for the same underlying welfare level. This is the core of scale-use heterogeneity.

Latent distribution: The unobserved underlying welfare distribution. Since we only see reported scores, conclusions can depend on assumptions about this hidden distribution.

3. Core assumptions

Using linear WELLBYs for cross-intervention comparison requires assumptions.[B]Benjamin, Cooper, Heffetz & Kimball (2024). "From Happiness Data to Economic Conclusions." Annual Review of Economics. Articulates four key assumptions for SWB-based welfare analysis. See also Frijters & Krekel (2021), A Handbook for Wellbeing Policy-Making, and UK Green Book supplementary guidance on wellbeing (HM Treasury, 2021/2024). The claim "equal scores mean equal welfare" is stronger than most applications need—what matters for intervention comparison is whether changes in scores reflect comparable welfare changes.

Note: The Benjamin et al. framework offers limited direct focus on comparing interventions in LMICs using RCT data—the central context for our workshop. There may be a research gap here, though HLI has been doing significant applied work in this space—more discussion is needed on how their research connects to these conceptual frameworks. (Tessa Peasgood kindly shared this paper with workshop participants.)

Cardinality (linearity) ⚠ Key contested assumption

Equal intervals on the scale imply equal welfare differences: moving from 3→4 equals the same welfare gain as 7→8. If violated, summing may distort comparisons.[12]Plant (2024) explores conditions under which treating LS as cardinal is defensible. See also HLI's cost-effectiveness methodology.

Many practitioners consider this the most consequential assumption—and the least obviously defensible. See the Bond & Lang critique for why scale transformations matter.

Personal note (David Reinstein): Of the four assumptions, this is the one I find least plausible and most important. If the scale is not cardinal, then summing and averaging life satisfaction scores is not meaningful, and intervention comparisons built on such sums may be unreliable.

Unit-change comparability

A one-point change has approximately the same welfare meaning across people. This is weaker than requiring equal levels to mean equal welfare.[23]Strictly, interpersonal comparability of levels (LS_A = 7 implies U_A = U_B when LS_B = 7) is not necessary for intervention comparison. If Person A experiences higher welfare at all reported scores but the differences between scores are comparable across individuals, then within-person changes can still be meaningfully compared. In other words, we need comparability of scale intervals, not of scale positions. This weaker condition is what "unit-change comparability" captures.

Why this is sufficient: For comparing intervention effects, we don't need equal levels to mean equal welfare—only that changes are comparable. If Person A is always 2 points happier than B at any objective welfare level, that doesn't bias intervention comparisons as long as a 1-point gain means the same for both.

Trade-off implication: Would you trade 2 years at LS=7 for 1 year at LS=9? If a one-point change has the same welfare meaning, these should be equivalent (2 WELLBYs each). If not—if gains at higher levels feel "less valuable"—the assumption is violated.

Temporal aggregation

Integrating wellbeing over time is meaningful. May fail if adaptation returns people to baseline, or if respondents reinterpret the scale over time (response shift).

Cross-domain capture (contested)

Life satisfaction aims to incorporate welfare from many domains (health, income, relationships). However, substantial evidence suggests LS does not capture everything people care about. People often knowingly choose options that yield lower life satisfaction for the sake of other valued outcomes.[21]Benjamin, Heffetz, Kimball & Rees-Jones (2012, 2014) show people choose options yielding lower LS for other things they value. Benjamin et al. (2026) "What Do People Want?" finds LS itself is relatively low on the list of things people want—and social desirability concerns lead people to understate how much they care about mental health. Neither DALYs nor life satisfaction may be adequate summary measures; multi-dimensional approaches capturing both physical disability and mental health—plus careful tradeoff elicitation—may be needed.

Time structure and discounting

Why this matters for intervention comparison: WELLBY calculations aggregate wellbeing gains over time. If Intervention A produces a 0.5-point LS boost for 2 years while Intervention B produces a 0.3-point boost for 5 years, B yields more total WELLBYs—but only if the effects actually persist as assumed. The time dimension is often the largest source of uncertainty.

Most studies measure outcomes at baseline and one or two follow-ups;[15]LMIC examples: Haushofer & Shapiro (2016, 2018) measure SWB at 9 months and 3 years after Kenya cash transfers; StrongMinds studies typically follow up at 3-6 months post-treatment; GiveDirectly studies (Kenya, Uganda) measure at 1-3 years; BRAC graduation programs measure at 2+ years. Most LMIC wellbeing RCTs have 1-2 follow-up points with limited long-term tracking. extrapolating beyond measured timepoints requires assumptions about persistence (does the effect last?) and discounting (are future wellbeing gains worth less than present ones?).

Response Shift: a distinct threat

Even if baseline scale-use heterogeneity cancels out[20]"Cancels out" means: in a randomized trial, individual differences in scale use (shifters) are equally distributed between treatment and control groups, so when we compare average changes, these individual differences subtract out. This is similar to how fixed effects absorb level differences. (by randomization), treatment can change the meaning of the respondent's self-evaluation. This is called response shift: changes in internal standards, values, or conceptualization that alter how respondents answer over time.[rs]Sprangers & Schwartz (1999). "Integrating response shift into health-related quality of life research." Social Science & Medicine.

For wellbeing interventions—especially psychosocial programs that may explicitly reframe cognition—response shift is a genuine concern, not merely a theoretical possibility. If treatment changes the reporting function $f_i(\cdot)$, observed $\Delta LS$ mixes "true welfare change" with "scale change," potentially biasing the WELLBY estimate in either direction.

4. A key critique: identification and transformations

Bond & Lang (2019) argue that with ordinal happiness data, comparing average wellbeing between groups (e.g., "are Germans happier than Americans?") is not identified without strong assumptions—monotonic transformations can reverse conclusions.[3]Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." JPE, 127(4). However, our focus is on intervention effects, where randomization mitigates some (but not all) of these concerns—see box below.

Response: Kaiser & Lepinteur (2023) argue that Bond & Lang's critique, while mathematically correct, relies on transformations that are empirically implausible. In practice, the transformations needed to reverse conclusions often require welfare to be decreasing in reported happiness at certain ranges—which contradicts the basic validity of self-reports. When restricting to plausible monotonic transformations, most happiness findings remain robust. See: Kaiser & Lepinteur, "The Reliability of Ordinal Happiness Data" (JOHS, 2023).

Our focus is comparing specific interventions (e.g., "does StrongMinds produce more wellbeing per dollar than cash transfers?"), not ranking countries by average happiness. Bond & Lang's core examples involve comparing average happiness between groups (e.g., "are Germans happier than French?")—a context where the identification problem is most severe. For within-study intervention comparisons—randomized treatment vs. control, especially in LMIC RCTs—randomization ensures similar baseline scale-use distributions, so we are comparing changes within comparable groups rather than absolute levels across different populations. The identification problem resurfaces when comparing effect sizes across studies conducted in different populations (e.g., a depression intervention in Uganda vs. a cash transfer in Kenya), or when treatment and control groups differ systematically at baseline.

What "non-identified" means (technical detail)

A parameter is identified when data + assumptions pin down a unique value. With ordinal responses (someone reports "7"), we only know their underlying welfare falls in some range corresponding to "7"—we don't know the exact welfare level.

Concrete example: Suppose Person A reports "7" and Person B reports "6". We know A reported higher, but not how much higher in welfare terms. Consider two possible mappings from reported scores to actual welfare:

  • Mapping 1 (linear): "6" = 60 utils, "7" = 70 utils → A is 10 utils better off
  • Mapping 2 (concave): "6" = 50 utils, "7" = 52 utils → A is only 2 utils better off

Both mappings are consistent with the observed data (A reported higher than B). Without additional structure (like assuming the scale is linear), we cannot determine which mapping is correct. This is why average welfare comparisons are "non-identified"—the data alone cannot tell us the magnitude of the difference.

Why this matters for intervention comparison

Transformation Sensitivity Demo

What this shows: Bond & Lang's critique hinges on the fact that we only observe ordinal responses (1, 2, 3...), not the underlying welfare. Any monotonic transformation $g(x)$ that preserves order is equally consistent with the data—but different transformations can reverse which group has higher mean welfare.

What is a monotonic transformation? A function that preserves ordering—if A>B before, then g(A)>g(B) after. Examples: squaring (x²), square root (√x), or log(x). These change the spacing between values without changing which is larger.

Below, $g(x) = x^θ$ transforms raw LS scores. When θ=1 (linear), LS is used directly. When θ>1 (convex), high scores are stretched more than low scores. When θ<1 (concave), the opposite.

1.0

Try it: In "effects" mode, move θ from 1.0 toward 2.0 and watch the ranking flip. Intervention B has a larger raw effect (2 vs 1 point), but A's effect occurs at higher LS levels. Under convex transformations (θ>1), gains at high levels are amplified—so A can dominate despite smaller raw gains.

5. Scale-use heterogeneity: shifters vs. stretchers

When comparing different interventions using WELLBY estimates, a key concern is that different people (or populations) may use the life satisfaction scale differently. A useful framework is the affine model, which separates two types of scale-use differences:[11]The "shifters vs. stretchers" framework derives from Benjamin et al. (2012, 2014, 2023). See also Oswald (2008) and Kaiser & Oswald (2022) on scale-use heterogeneity. This matters especially when comparing intervention effects measured in different study populations—for example, a mental health program in Uganda vs. a cash transfer in Kenya.

$u_{it} = a_i + b_i \cdot LS_{it}$

Here $u_{it}$ is person $i$'s true welfare at time $t$, and $LS_{it}$ is their reported life satisfaction. The parameters $a_i$ (shift) and $b_i$ (stretch) capture individual scale-use patterns.

Shifters (different $a_i$)

Different intercepts: "some people always report +2 higher." Levels are not comparable, but differences are: $\Delta u = b \cdot \Delta LS$.

Stretchers (different $b_i$)

Different slopes: "some people compress the scale." Both levels AND differences are non-comparable: if person A has $b_A=0.5$ and person B has $b_B=1.5$, identical reported changes (ΔLS=2) correspond to different welfare changes ($\Delta u_A=1$ vs $\Delta u_B=3$).

Benjamin et al. propose calibration questions to identify and adjust for scale-use heterogeneity—questions designed to have the same objective answer across respondents.[4]Benjamin et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.

Shifter vs. Stretcher Demo

Compare two populations with different scale use. See why stretchers distort intervention comparisons.

Population A
Population B
Why fixed effects only remove shifts

Fixed effects absorb level differences (the $a_i$ terms). But if people have different $b_i$ (stretch factors), the implied welfare change per reported point differs.

6. Neutral point and mortality

When comparing interventions that affect mortality (e.g., bednets) against those that affect only wellbeing among the living (e.g., mental health programs), the choice of "zero point" becomes critical. Two different "zeros" are relevant (see HLI's "The Elephant in the Bednet" for a comprehensive treatment):

For incremental comparisons among the living, the neutral point often cancels out—that is, when comparing ΔLS between intervention and control groups, both are measured from the same implicit baseline, so the zero point subtracts from both sides and doesn't affect the difference.[14]Mathematically: (LS_treatment − LS_0) − (LS_control − LS_0) = LS_treatment − LS_control. The neutral point LS_0 cancels. But for mortality comparisons—"WELLBYs from life extension = life-years × average wellbeing"—the origin is load-bearing.

Neutral Point / Mortality Demo

2

Scenario: Mortality intervention prevents a death, yielding 40 additional life-years at average LS = 5.

If neutral = 0: Benefit = 40 × 5 = 200 WELLBYs

If neutral = 2: Benefit = 40 × (5 − 2) = 120 "above-neutral" WELLBYs

When does the neutral point matter? When comparing interventions that affect mortality to those that don't. For comparisons among living people only, the neutral point typically cancels out: since both treatment and control are measured from the same baseline, subtracting that baseline from both sides leaves the difference unchanged.[14]

Empirical neutral point estimates (0.6 to 6.0 range) — click to expand

Recent work has attempted to estimate the neutral point empirically. The estimates vary widely (0.6 to 6.0 on a 0-10 scale) depending on the elicitation method and sample. HLI is advising the UK Green Book guidelines and has compiled the following estimates:[*]Table compiled by Samuel Dupret (HLI), shared March 2026. Note that different questions elicit very different values—asking about "life no longer worth living" yields lower estimates than asking about "minimally acceptable" levels.

Source Value Method Sample
Samuelsson et al. (2023) - HLI pilot 1.26 Asked when life is no longer worth living on 0-10 LS scale N=79, UK
Samuelsson et al. (2023) - HLI pilot 5.30 Asked where balance between satisfied/dissatisfied on 0-10 LS N=128, UK
Peasgood et al. (unpublished) 2.00 Time trade-offs (QALY method rather than wellbeing scale) N=75, UK
IDinsight Beneficiary Survey (2019) 0.56 "At what point on the ladder is it worse than dying?" N=70, Ghana & Kenya
Moss (Rethink Priorities, unpublished) 2.49 Asked level preferring alive to dead (converted 0-100 → 0-10) N=35, likely UK
Moss (Rethink Priorities, unpublished) 6.05 Asked minimally acceptable level to live an extra year N=101, likely UK
Jamison et al. (forthcoming) 2.39 Policy comparison: saving people from dying (0-100 → 0-10) N=1800 (Brazil, China, UK)
Jamison et al. (forthcoming) 2.54 Policy comparison: saving people from non-existence (0-100 → 0-10) N=1800 (Brazil, China, UK)

Key observations:

  • Question framing matters enormously: "Life no longer worth living" (1.26) vs "minimally acceptable" (6.05) yields 5× difference
  • The only LMIC estimate (IDinsight, Ghana/Kenya) is the lowest at 0.56—though small sample and different question framing
  • Jamison et al. provides the largest cross-country sample (N=1800 across Brazil, China, UK) with estimates around 2.4-2.5
  • HM Treasury (UK Green Book) currently uses 1.0 (based on suicide rates); Frijters & Krekel recommend 2.0

7. Evidence and alternatives

Reliability: noisy but not useless

Single-item life evaluations have test-retest correlations around 0.5-0.7 over short windows.[5]Krueger & Schkade (2008). "The reliability of subjective well-being measures." This means measurement error attenuates estimated effects—small real effects may be undervalued. For relative comparisons: if measurement error is similar across interventions and contexts, attenuation affects absolute magnitudes but may preserve relative rankings. However, if noise levels differ (e.g., different survey modes, cultural contexts, or outcome domains), relative comparisons become less reliable.

Predictive validity

What this means: "Predictive validity" asks whether LS scores relate to real-world behaviors and outcomes in expected ways. If LS contains meaningful information about welfare, lower scores should predict people making costly changes (moving, quitting jobs, leaving relationships) to improve their situation.

Kaiser & Oswald show that single numeric feelings responses do predict these consequential outcomes—relationships tend to be replicable and approximately linear.[6]Kaiser & Oswald (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS. See also Plant (2024) for additional evidence that LS measures predict consequential outcomes and respond to changes (like income transfers) we expect to matter. This is notable because a single "made-up feelings integer" turns out to have more predictive power for subsequent costly actions (e.g., moving house, leaving a partner, quitting a job, visiting a hospital) than a collection of standard socioeconomic variables including income, employment, and education.[27]Kaiser & Oswald (2022), Table 2: a single feelings integer outperforms combined socioeconomic predictors for "get-me-out-of-here" actions across UK, German, and Australian panel data. For example, moving from LS=3 to LS=7 is associated with roughly halving the probability of leaving one's neighbourhood in the subsequent period. This suggests LS captures something meaningful, though it doesn't directly prove magnitudes are comparable across people.

Note: Predictive validity is necessary but not sufficient for using WELLBYs. A measure could predict outcomes yet still be noisy, inconsistent across groups, or respond to treatments in ways that don't reflect welfare gains. It establishes that LS isn't pure noise—not that it's ready for precise cross-intervention comparison.

Response times as identification: Liu & Netzer (2023) propose using survey response times to help solve identification problems in ordered response models.[28]Liu, S. & Netzer, N. (2023). "Happy Times: Measuring Happiness Using Response Times." AER, 113(12), 3289-3322. The intuition: if two people both report "7" but one answers instantly while another deliberates, the fast responder is likely further from the boundary between "7" and adjacent categories—their latent happiness is more clearly in the "7" range. This "chronometric effect" provides distributional information that the response alone cannot. The counterintuitive insight is that how quickly someone answers a happiness question contains information about the distribution of their underlying wellbeing—not just their point estimate. Their evidence is mixed: conventional distributional assumptions are rejected in some cases but broadly supported overall.

LMIC evidence: GiveDirectly cash transfers

Kenya RCT (Haushofer & Shapiro)

  • Short-run (~9 months): Life satisfaction +0.17 SD, happiness +0.16 SD (WVS measures)
  • Long-run (~3 years): Life satisfaction +0.08 SD (statistically significant); psychological wellbeing index +0.16 SD

These findings show LS measures have detectable signal in LMIC RCTs, not just noise—and effects can persist.[7]Haushofer & Shapiro (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. Also 2018 long-term follow-up working paper.

More evidence: See HLI's comprehensive GiveDirectly summary for additional studies and meta-analytic estimates of cash transfer effects on wellbeing.

The outcome translation problem

Many LMIC mental health studies report depression scales or symptom indices, not standard 0-10 life satisfaction. GiveWell's assessment of StrongMinds explicitly highlights uncertainty in translating depression improvements into life satisfaction gains.[gw]GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds." Notes substantial uncertainty in depression-to-LS mapping.

Even if you accept WELLBY as the target unit, outcome translation forces choices: use DALYs/QALYs (more standard in health evaluation) even if they miss non-health welfare, use life satisfaction directly but only where trials collect it, or use mapping models—statistical relationships estimated from observational data that predict life satisfaction from other measures (e.g., "a 1-SD reduction in PHQ-9 depression score corresponds to ~0.4 points higher LS")—but carry mapping uncertainty explicitly.

Comparison with alternatives

Metric Strengths Weaknesses
WELLBY Captures non-health welfare; direct self-report; low burden Scale-use, comparability assumptions; cross-study issues
DALY[17]DALYs (Disability-Adjusted Life Years) = Years of Life Lost + Years Lived with Disability. Disability weights come from Global Burden of Disease population surveys where respondents rate hypothetical health states—not from the affected individual's own experience. Standardized globally (GBD); large evidence base; direct mortality link; institutional acceptance May miss non-health welfare; mental health disability weights derived from hypothetical ratings rather than patient experience; whether a year with disability is truly equivalent to the corresponding fraction of a year of life lost remains contested[29]The DALY framework assumes that a disability weight of, say, 0.5 means living one year in that state is equivalent to losing half a year of life. This equivalence is built into the metric by construction, not empirically verified. For mental health conditions, where subjective experience may diverge from external ratings, this assumption is particularly uncertain.
QALY[30]QALYs (Quality-Adjusted Life Years) = years of life x health utility weight (0-1). Unlike DALYs, QALYs often derive weights from patient assessments via instruments like EQ-5D, making them somewhat closer to self-report—though still anchored to health-state descriptions rather than overall life satisfaction. Patient-derived weights (via EQ-5D etc.); widely used in health technology assessment (e.g., NICE) Narrower than WELLBYs (health states only, may miss empowerment, meaning, security); EQ-5D may not capture mental health conditions well
Scale-adjusted SWB[22]Scale-adjusted SWB: Subjective wellbeing measures adjusted for scale-use heterogeneity using techniques from Benjamin et al. (2023)—e.g., vignette anchoring or statistical correction for "shifters" and "stretchers." Aims to make scores more comparable across individuals and groups. May reduce scale-use bias (magnitude context-dependent) Complex; LMIC feasibility unclear; new assumptions

8. WELLBY Calculator

Incremental WELLBY Estimate

Enter treatment effect, duration, recipients, and cost to estimate total WELLBYs and cost-effectiveness.

1,000
Total WELLBYs generated
Cost per WELLBY: $100

This calculator assumes constant effect size. Real applications should account for effect decay, discounting, and uncertainty.

9. Considerations for discussion

The following are practices and considerations found in the literature—not workshop conclusions or endorsements. A key goal of the workshop is to assess which of these, if any, represent genuine improvements over current practice, and what their practical limitations are.

What are funders currently doing?

Some funders and evaluators (e.g., HLI, Founders Pledge) report using approaches such as:

Whether these practices adequately address the measurement concerns discussed above is itself a question for discussion.

What methodological improvements have been proposed?

Researchers have proposed various approaches to strengthen WELLBY-based comparisons. Each involves trade-offs between feasibility and rigor:

Under what conditions might WELLBY comparisons be more reliable?

Several conditions have been suggested in the literature as potentially supporting stronger inference, though each carries its own caveats:

10. Open questions (research agenda)

The following questions have been identified by researchers in this area (including workshop participants and authors cited above) as areas where additional evidence could meaningfully improve the reliability of WELLBY-based comparisons:

  1. Neutral point estimation: What is the actual neutral point on the 0-10 scale for different populations? How stable is it across contexts?[16]Limited LMIC-specific research exists on neutral point estimation. Peasgood et al. (2018) provide estimates for UK populations; whether these generalize to LMIC contexts is unclear.
  2. Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more?
  3. Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive burden?
  4. WELLBY-DALY relationship: What's the mapping between WELLBYs and DALYs, and is it linear? How much does it vary by health condition?
  5. Demand effects and response shift: How do experimenter demand effects and response shift vary by intervention type?

11. Workshop prompts

Neutral prompts for workshop deliberation:

1. For which classes of intervention comparisons (same setting/instrument vs. cross-study) does the linear WELLBY seem most defensible, and why?

2. Which assumptions are most likely to be materially violated in LMIC contexts: linearity, intertemporal comparability, interpersonal comparability, or scale-use heterogeneity?

3. When does the neutral point become decision-relevant? Which "zero" do you have in mind?

4. How should analysts treat "mapping" between depression scales and life satisfaction when LS isn't measured? What minimum evidence would make a mapping credible?

5. Which low-burden calibration approaches seem most promising for LMIC settings?

6. How should the choice between DALYs, QALYs, and WELLBYs depend on the type of intervention being evaluated? Are there cases where one metric is clearly more appropriate?

7. Given current evidence and uncertainty, what would you want to see change in how funders use WELLBY estimates? What would make you more (or less) confident in their use for cross-intervention comparison?

References

  1. Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Health Economics, 29(12).
  2. OECD (2013; 2025 update). Guidelines on Measuring Subjective Well-being. The 2025 update incorporates a decade of measurement experience and concludes that SWB data are meaningful for policy despite critiques, while recommending improved harmonization.
  3. Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." Journal of Political Economy, 127(4).
  4. Benjamin, D.J. et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
  5. Krueger, A.B. & Schkade, D.A. (2008). "The reliability of subjective well-being measures." Journal of Public Economics.
  6. Kaiser, C. & Oswald, A.J. (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS, 119(42).
  7. Haushofer, J. & Shapiro, J. (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. See also 2018 long-term follow-up.
  8. HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.
  9. Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Behavioural Public Policy, 4(2). Introduces the WELLBY framework.
  10. GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds."
  11. Benjamin, D.J., Heffetz, O., Kimball, M.S. & Szembrot, N. (2014). "Beyond Happiness and Satisfaction: Toward Well-Being Indices Based on Stated Preference." AER, 104(9). The shifters/stretchers framework is elaborated in Benjamin et al. (2023) NBER WP 31728.
  12. Plant, M. (2024). "A Happy Possibility About Happiness (And Other Subjective) Scales: An Investigation and Tentative Defence of the Cardinality Thesis." Wellbeing Research Centre working paper.
  13. McGuire, J. et al. (2024). "The Wellbeing Cost-Effectiveness of StrongMinds and Friendship Bench." Happier Lives Institute Report. Note: HLI's scope is global, not limited to LMICs—they evaluate interventions worldwide. See also their GiveDirectly analysis, response to GiveWell, and other CEAs.
  14. When comparing intervention effects for living people, the neutral point cancels algebraically: (LS_treatment − LS₀) − (LS_control − LS₀) = LS_treatment − LS_control.
  15. LMIC study examples: Haushofer & Shapiro (2016, 2018) measured SWB at 9 months and 3 years post cash transfers in Kenya; StrongMinds evaluations typically follow up at 3-6 months.
  16. Limited LMIC-specific research exists on neutral point estimation. Peasgood et al. (2018) provide estimates for UK populations; whether these generalize to LMIC contexts is unclear.
  17. DALYs (Disability-Adjusted Life Years) = Years of Life Lost (YLL) + Years Lived with Disability (YLD). Disability weights are derived from Global Burden of Disease (GBD) population surveys where respondents rate hypothetical health states—not from the affected individual's own experience. WELLBYs, by contrast, ask each individual to rate their own life satisfaction directly.
  18. "Incremental" vs "level-based" is descriptive language used in this document to distinguish two common uses of WELLBYs. The academic literature typically just refers to "WELLBYs" with context clarifying whether the focus is on intervention effects (changes) or absolute welfare levels (for mortality calculations).
  19. Within-person designs reduce scale-use heterogeneity across individuals but may introduce demand effects—participants may feel motivated to report improvement to please researchers, especially for psychosocial interventions with non-blinded delivery.
  20. "Cancels out" in randomization context: if scale-use differences (shifters and stretchers) are randomly distributed between treatment and control groups, within-study comparisons remain valid—the average effect estimate is unbiased. The problem arises when (a) comparing across studies with different scale-use distributions, or (b) treatment itself changes the reporting function (response shift, discussed above), which randomization cannot address.
  21. Benjamin, D.J., Heffetz, O., Kimball, M.S. & Rees-Jones, A. (2012). "What Do You Think Would Make You Happier?" JEP; Benjamin et al. (2014) "Beyond Happiness and Satisfaction" AER. Benjamin, Cooper, Heffetz, Kimball & Kundu (2026). "What Do People Want?" (working paper) finds life satisfaction itself ranks relatively low among things people want. Importantly, social desirability concerns lead people to understate how much they care about mental health.
  22. Scale-adjusted SWB: Subjective wellbeing scores adjusted for scale-use heterogeneity—different people use the 0-10 scale differently ("shifters" use consistently higher/lower values; "stretchers" use more/less of the scale's range). Methods include vignette anchoring and statistical corrections. See Benjamin et al. (2023) "Adjusting for Scale-Use Heterogeneity" NBER WP 31728.
  23. Interpersonal comparability of levels (LS_A = 7 implies U_A = U_B when LS_B = 7) is not necessary for intervention comparison if we can instead assume that differences are equivalent across people. If Person A experiences higher welfare at all reported scores but the differences between scores are comparable, then within-person changes can still be meaningfully compared across individuals. This weaker condition—comparability of scale intervals rather than scale positions—is what "unit-change comparability" captures.
  24. Kaiser & Oswald (2022), Table 2: a single "made-up feelings integer" outperforms combined socioeconomic predictors (income, employment, education, children, homeownership) for predicting subsequent "get-me-out-of-here" actions—moving house, leaving a partner, quitting a job, visiting a hospital—across UK (BHPS), German (SOEP), and Australian (HILDA) panel data.
  25. Liu, S. & Netzer, N. (2023). "Happy Times: Measuring Happiness Using Response Times." American Economic Review, 113(12), 3289-3322. Response times provide information about the distribution of latent wellbeing through a "chronometric effect": people who are far from the boundary between adjacent response categories answer faster.
  26. The DALY framework assumes that a disability weight of, say, 0.5 means living one year in that state is equivalent to losing half a year of life. This equivalence is built into the metric by construction, not empirically verified. For mental health conditions, where subjective experience may diverge from external hypothetical ratings, this assumption is particularly uncertain.
  27. QALYs (Quality-Adjusted Life Years) = years of life x health utility weight (0-1 scale). Unlike DALYs, QALYs often derive weights from patient self-assessments via instruments like EQ-5D, making them somewhat closer to self-report. However, they remain anchored to health-state descriptions rather than overall life satisfaction, and may not capture mental health conditions well. See NICE technology appraisal guidance for institutional usage.

Related Analysis

For discussion of how to convert between DALYs/QALYs and WELLBYs:

DALY/QALY↔WELLBY Conversion →

Note: AI-generated draft requiring verification