⚠️ AI-Generated Content (March 2026) — click to expand
This page was generated with AI assistance (Claude Code + ChatGPT deep research) and revised based on 75+ Hypothes.is workshop comments.
The content aims to be workshop-neutral, framing issues for deliberation rather than prescribing conclusions. Readers should verify quantitative claims against the original literature.
Audio Version (~25 min)
Listen to an audio narration of this page (British academic voice):
Download MP3 (8 MB) Text ScriptGenerated with Microsoft Edge TTS (en-GB-RyanNeural)
1. The decision problem
Organizations comparing interventions—especially in low- and middle-income countries (LMICs)—face a measurement problem: interventions change different things (mortality, morbidity, consumption, mental health, social cohesion). The WELLBY approach proposes translating these into a common unit based on subjective wellbeing, enabling "welfare impact per dollar" comparisons.[1]Frijters, Clark, Krekel & Layard (2020). "A Happy Choice: Wellbeing as the Goal of Government."
A focal question for this workshop: How reliably can we compare interventions, especially in LMICs, by aggregating changes in reported wellbeing (the WELLBY approach), especially across different studies and contexts? This is one of several comparison frameworks; others include DALY/QALY-based approaches, capability approaches, and direct monetary valuation. The workshop examines the linear WELLBY's reliability relative to these alternatives, not in isolation.
🔬 The Focal Case: StrongMinds vs AMF (click to expand)
The debate over whether mental health interventions outperform anti-malaria bednets illustrates why WELLBY methodology matters. This comparison—central to effective altruism discourse—hinges on contested assumptions about measuring and aggregating wellbeing gains.
The Interventions
Group interpersonal therapy (IPT-G) for depression, primarily in Uganda, Zambia, and other African countries. Community health workers deliver 12-week group sessions to women with moderate-to-severe depression.
Distributes long-lasting insecticide-treated bednets (LLINs) in malaria-endemic regions. GiveWell's top-rated charity for over a decade, with extensive mortality reduction evidence.
Problem-solving therapy delivered by trained community health workers ("grandmother counselors") on park benches in Zimbabwe. The Chibanda et al. (2016) cluster RCT showed significant effects on depression symptoms.
Unconditional cash transfers (~$1,000) to poor households in Kenya and Uganda. Haushofer & Shapiro (2016) RCT measured life satisfaction directly, finding 0.16 SD improvement at 9 months.
What Data Was Collected?
The Translation Challenge
To compare these interventions, evaluators must translate different metrics into a common currency:
- AMF → WELLBYs: Convert DALYs averted to life-years, then assign WELLBY values to those years. Requires assumptions about the wellbeing level of lives saved.
- StrongMinds → WELLBYs: Convert depression scale changes (PHQ-9) to life satisfaction changes using cross-sectional correlations. Then extrapolate effect duration beyond measured follow-up.
- GiveDirectly → WELLBYs: Direct measurement of LS changes, but requires duration assumptions (do effects persist?).
📚 Further reading: See Unjournal evaluations of mental health research for independent expert assessments of the underlying evidence.
The measurement-to-decision pipeline illustrates why comparing interventions requires multiple translation steps. Each box represents a stage where methodological choices affect final conclusions:
(e.g., cash transfer,
mental health program)"] --> B["Study design
(RCT, quasi-experiment)"] B --> C["Measured outcomes
LS / DALY / depression scale"] C --> D["Translation layer
mapping, calibration,
assumptions"] D --> E["Common currency
WELLBY / DALY / $"] E --> F["Decision /
deliberation"]
How to read this diagram (click to collapse)
- Intervention → Study design: The program being evaluated is studied through some research design (RCT, quasi-experiment, etc.)
- Study design → Measured outcomes: Studies measure different things—some use life satisfaction (LS), others use DALYs or depression scales
- Measured outcomes → Translation layer: Different metrics must be mapped or calibrated to enable comparison
- Translation layer → Common currency: The goal is a single unit (WELLBYs, DALYs, or dollars) enabling "apples-to-apples" comparison
- Common currency → Decision: Funders use the common currency to prioritize interventions
Each arrow involves assumptions that can introduce error or bias. The workshop focuses on where these assumptions are most likely to matter.
Workshop goals: (1) Clarity about which assumptions matter most for which comparisons, and what evidence would change views. (2) Share information and synthesize participant expertise. (3) Generate practical insights and actionable recommendations for funders working with current evidence.
2. Definitions and key concepts
WELLBY Definition
1 WELLBY = a one-point change in life satisfaction (0-10 scale) × 1 person × 1 year
Standard LS question (ONS/UK): "Overall, how satisfied are you with your life nowadays? Please answer on a scale of 0 to 10, where 0 means 'not at all satisfied' and 10 means 'completely satisfied'."
Source: UK Green Book Wellbeing Guidance (HM Treasury, 2021/2024)
Origins, alternative definitions, and adoption
Original proposal: Frijters, Clark, Krekel & Layard (2020) introduced WELLBYs in Health Economics as a unit for comparing wellbeing gains across policy domains.
Alternative definitions: Most usage defines WELLBY using life satisfaction (Cantril ladder, 0-10), but some researchers use affect-based measures (experienced happiness). The choice matters: life satisfaction captures evaluative wellbeing; affect captures momentary experience.
Organizational adoption:
- Happier Lives Institute (Plant et al.): Primary metric for charity comparison; extensive cost-effectiveness analyses[13]HLI cost-effectiveness analyses; see also Plant (2024) on cardinality.
- Founders Pledge: Uses WELLBYs alongside DALYs for mental health CEA
- GiveWell: Has explored WELLBY analysis but with significant reservations
- UK Government: Official guidance for policy appraisal (the "Green Book")
Standard Life Satisfaction Questions
OECD single-item: "Overall, how satisfied are you with your life as a whole these days?" (0 = "not at all satisfied" to 10 = "completely satisfied")[2]OECD Guidelines on Measuring Subjective Well-being (2013/2024). Question modules available at oecd.org.
Cantril ladder (Gallup): "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you, and the bottom represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"
These two framings—satisfaction vs. ladder position—are often used interchangeably, but may capture subtly different constructs. Cross-study comparisons should note which instrument (survey question format) was used.
Incremental vs. Level-Based Frameworks
Where $LS_{it}$ is person $i$'s life satisfaction at time $t$ (on a 0-10 scale), $\delta$ is a discount factor for future periods, $k$ indexes the intervention, and $0$ is the counterfactual (no intervention). The sum runs over all individuals $i$ and time periods $t$. See notation key below for details.
Level-based WELLBYs (relevant for comparing interventions that affect mortality or, in some conceptual frameworks, birth rates):
Notation key
- $i$ = individual (summing across people)
- $t$ = time period (summing across years)
- $LS$ = Life Satisfaction score (0-10)
- $\delta$ = discount factor for future years
- $k$ = intervention; $0$ = counterfactual
In practice, RCTs estimate $LS^{(k)} - LS^{(0)}$ directly via experimental comparison of treatment and control group outcomes.
Technical definitions (reporting function, instrument, latent distribution)
Instrument: The specific measurement tool—exact question wording, response format (0-10 vs 1-7), anchors, translation, survey mode.
Reporting function: The internal process by which a person translates their true wellbeing ($u$) into a number on the survey scale. Formally: $LS_{it} = f_i(u_{it}) + \varepsilon_{it}$. Different people may have different reporting functions—one person's "7" might correspond to another's "5" for the same underlying welfare level. This is the core of scale-use heterogeneity.
Latent distribution: The unobserved underlying welfare distribution. Since we only see reported scores, conclusions can depend on assumptions about this hidden distribution.
3. Core assumptions
Using linear WELLBYs for cross-intervention comparison requires assumptions.[B]Benjamin, Cooper, Heffetz & Kimball (2024). "From Happiness Data to Economic Conclusions." Annual Review of Economics. Articulates four key assumptions for SWB-based welfare analysis. See also Frijters & Krekel (2021), A Handbook for Wellbeing Policy-Making, and UK Green Book supplementary guidance on wellbeing (HM Treasury, 2021/2024). The claim "equal scores mean equal welfare" is stronger than most applications need—what matters for intervention comparison is whether changes in scores reflect comparable welfare changes.
Note: The Benjamin et al. framework offers limited direct focus on comparing interventions in LMICs using RCT data—the central context for our workshop. There may be a research gap here, though HLI has been doing significant applied work in this space—more discussion is needed on how their research connects to these conceptual frameworks. (Tessa Peasgood kindly shared this paper with workshop participants.)
Cardinality (linearity) ⚠ Key contested assumption
Equal intervals on the scale imply equal welfare differences: moving from 3→4 equals the same welfare gain as 7→8. If violated, summing may distort comparisons.[12]Plant (2024) explores conditions under which treating LS as cardinal is defensible. See also HLI's cost-effectiveness methodology.
Many practitioners consider this the most consequential assumption—and the least obviously defensible. See the Bond & Lang critique for why scale transformations matter.
Personal note (David Reinstein): Of the four assumptions, this is the one I find least plausible and most important. If the scale is not cardinal, then summing and averaging life satisfaction scores is not meaningful, and intervention comparisons built on such sums may be unreliable.
Unit-change comparability
A one-point change has approximately the same welfare meaning across people. This is weaker than requiring equal levels to mean equal welfare.[23]Strictly, interpersonal comparability of levels (LS_A = 7 implies U_A = U_B when LS_B = 7) is not necessary for intervention comparison. If Person A experiences higher welfare at all reported scores but the differences between scores are comparable across individuals, then within-person changes can still be meaningfully compared. In other words, we need comparability of scale intervals, not of scale positions. This weaker condition is what "unit-change comparability" captures.
Why this is sufficient: For comparing intervention effects, we don't need equal levels to mean equal welfare—only that changes are comparable. If Person A is always 2 points happier than B at any objective welfare level, that doesn't bias intervention comparisons as long as a 1-point gain means the same for both.
Trade-off implication: Would you trade 2 years at LS=7 for 1 year at LS=9? If a one-point change has the same welfare meaning, these should be equivalent (2 WELLBYs each). If not—if gains at higher levels feel "less valuable"—the assumption is violated.
Temporal aggregation
Integrating wellbeing over time is meaningful. May fail if adaptation returns people to baseline, or if respondents reinterpret the scale over time (response shift).
Cross-domain capture (contested)
Life satisfaction aims to incorporate welfare from many domains (health, income, relationships). However, substantial evidence suggests LS does not capture everything people care about. People often knowingly choose options that yield lower life satisfaction for the sake of other valued outcomes.[21]Benjamin, Heffetz, Kimball & Rees-Jones (2012, 2014) show people choose options yielding lower LS for other things they value. Benjamin et al. (2026) "What Do People Want?" finds LS itself is relatively low on the list of things people want—and social desirability concerns lead people to understate how much they care about mental health. Neither DALYs nor life satisfaction may be adequate summary measures; multi-dimensional approaches capturing both physical disability and mental health—plus careful tradeoff elicitation—may be needed.
Time structure and discounting
Why this matters for intervention comparison: WELLBY calculations aggregate wellbeing gains over time. If Intervention A produces a 0.5-point LS boost for 2 years while Intervention B produces a 0.3-point boost for 5 years, B yields more total WELLBYs—but only if the effects actually persist as assumed. The time dimension is often the largest source of uncertainty.
Most studies measure outcomes at baseline and one or two follow-ups;[15]LMIC examples: Haushofer & Shapiro (2016, 2018) measure SWB at 9 months and 3 years after Kenya cash transfers; StrongMinds studies typically follow up at 3-6 months post-treatment; GiveDirectly studies (Kenya, Uganda) measure at 1-3 years; BRAC graduation programs measure at 2+ years. Most LMIC wellbeing RCTs have 1-2 follow-up points with limited long-term tracking. extrapolating beyond measured timepoints requires assumptions about persistence (does the effect last?) and discounting (are future wellbeing gains worth less than present ones?).
Response Shift: a distinct threat
Even if baseline scale-use heterogeneity cancels out[20]"Cancels out" means: in a randomized trial, individual differences in scale use (shifters) are equally distributed between treatment and control groups, so when we compare average changes, these individual differences subtract out. This is similar to how fixed effects absorb level differences. (by randomization), treatment can change the meaning of the respondent's self-evaluation. This is called response shift: changes in internal standards, values, or conceptualization that alter how respondents answer over time.[rs]Sprangers & Schwartz (1999). "Integrating response shift into health-related quality of life research." Social Science & Medicine.
For wellbeing interventions—especially psychosocial programs that may explicitly reframe cognition—response shift is a genuine concern, not merely a theoretical possibility. If treatment changes the reporting function $f_i(\cdot)$, observed $\Delta LS$ mixes "true welfare change" with "scale change," potentially biasing the WELLBY estimate in either direction.
4. A key critique: identification and transformations
Bond & Lang (2019) argue that with ordinal happiness data, comparing average wellbeing between groups (e.g., "are Germans happier than Americans?") is not identified without strong assumptions—monotonic transformations can reverse conclusions.[3]Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." JPE, 127(4). However, our focus is on intervention effects, where randomization mitigates some (but not all) of these concerns—see box below.
Our focus is comparing specific interventions (e.g., "does StrongMinds produce more wellbeing per dollar than cash transfers?"), not ranking countries by average happiness. Bond & Lang's core examples involve comparing average happiness between groups (e.g., "are Germans happier than French?")—a context where the identification problem is most severe. For within-study intervention comparisons—randomized treatment vs. control, especially in LMIC RCTs—randomization ensures similar baseline scale-use distributions, so we are comparing changes within comparable groups rather than absolute levels across different populations. The identification problem resurfaces when comparing effect sizes across studies conducted in different populations (e.g., a depression intervention in Uganda vs. a cash transfer in Kenya), or when treatment and control groups differ systematically at baseline.
What "non-identified" means (technical detail)
A parameter is identified when data + assumptions pin down a unique value. With ordinal responses (someone reports "7"), we only know their underlying welfare falls in some range corresponding to "7"—we don't know the exact welfare level.
Concrete example: Suppose Person A reports "7" and Person B reports "6". We know A reported higher, but not how much higher in welfare terms. Consider two possible mappings from reported scores to actual welfare:
- Mapping 1 (linear): "6" = 60 utils, "7" = 70 utils → A is 10 utils better off
- Mapping 2 (concave): "6" = 50 utils, "7" = 52 utils → A is only 2 utils better off
Both mappings are consistent with the observed data (A reported higher than B). Without additional structure (like assuming the scale is linear), we cannot determine which mapping is correct. This is why average welfare comparisons are "non-identified"—the data alone cannot tell us the magnitude of the difference.
Why this matters for intervention comparison
- Cross-study synthesis: Comparing treatment effects measured on different scales can inherit the same vulnerability
- Magnitude-sensitive cost-effectiveness: Even if both Intervention A and B show positive effects, cost-effectiveness depends on how much better. If A costs twice as much but has a 2.5× larger effect, it's more cost-effective—but only if that magnitude comparison is valid.
Transformation Sensitivity Demo
What this shows: Bond & Lang's critique hinges on the fact that we only observe ordinal responses (1, 2, 3...), not the underlying welfare. Any monotonic transformation $g(x)$ that preserves order is equally consistent with the data—but different transformations can reverse which group has higher mean welfare.
What is a monotonic transformation? A function that preserves ordering—if A>B before, then g(A)>g(B) after. Examples: squaring (x²), square root (√x), or log(x). These change the spacing between values without changing which is larger.
Below, $g(x) = x^θ$ transforms raw LS scores. When θ=1 (linear), LS is used directly. When θ>1 (convex), high scores are stretched more than low scores. When θ<1 (concave), the opposite.
Try it: In "effects" mode, move θ from 1.0 toward 2.0 and watch the ranking flip. Intervention B has a larger raw effect (2 vs 1 point), but A's effect occurs at higher LS levels. Under convex transformations (θ>1), gains at high levels are amplified—so A can dominate despite smaller raw gains.
5. Scale-use heterogeneity: shifters vs. stretchers
When comparing different interventions using WELLBY estimates, a key concern is that different people (or populations) may use the life satisfaction scale differently. A useful framework is the affine model, which separates two types of scale-use differences:[11]The "shifters vs. stretchers" framework derives from Benjamin et al. (2012, 2014, 2023). See also Oswald (2008) and Kaiser & Oswald (2022) on scale-use heterogeneity. This matters especially when comparing intervention effects measured in different study populations—for example, a mental health program in Uganda vs. a cash transfer in Kenya.
Here $u_{it}$ is person $i$'s true welfare at time $t$, and $LS_{it}$ is their reported life satisfaction. The parameters $a_i$ (shift) and $b_i$ (stretch) capture individual scale-use patterns.
Shifters (different $a_i$)
Different intercepts: "some people always report +2 higher." Levels are not comparable, but differences are: $\Delta u = b \cdot \Delta LS$.
Stretchers (different $b_i$)
Different slopes: "some people compress the scale." Both levels AND differences are non-comparable: if person A has $b_A=0.5$ and person B has $b_B=1.5$, identical reported changes (ΔLS=2) correspond to different welfare changes ($\Delta u_A=1$ vs $\Delta u_B=3$).
Benjamin et al. propose calibration questions to identify and adjust for scale-use heterogeneity—questions designed to have the same objective answer across respondents.[4]Benjamin et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
Shifter vs. Stretcher Demo
Compare two populations with different scale use. See why stretchers distort intervention comparisons.
Why fixed effects only remove shifts
Fixed effects absorb level differences (the $a_i$ terms). But if people have different $b_i$ (stretch factors), the implied welfare change per reported point differs.
6. Neutral point and mortality
When comparing interventions that affect mortality (e.g., bednets) against those that affect only wellbeing among the living (e.g., mental health programs), the choice of "zero point" becomes critical. Two different "zeros" are relevant (see HLI's "The Elephant in the Bednet" for a comprehensive treatment):
- Neutral wellbeing: the point where life is neither good nor bad
- Death as zero: treating "dead people score 0" for life-year calculations
For incremental comparisons among the living, the neutral point often cancels out—that is, when comparing ΔLS between intervention and control groups, both are measured from the same implicit baseline, so the zero point subtracts from both sides and doesn't affect the difference.[14]Mathematically: (LS_treatment − LS_0) − (LS_control − LS_0) = LS_treatment − LS_control. The neutral point LS_0 cancels. But for mortality comparisons—"WELLBYs from life extension = life-years × average wellbeing"—the origin is load-bearing.
Neutral Point / Mortality Demo
Scenario: Mortality intervention prevents a death, yielding 40 additional life-years at average LS = 5.
If neutral = 0: Benefit = 40 × 5 = 200 WELLBYs
If neutral = 2: Benefit = 40 × (5 − 2) = 120 "above-neutral" WELLBYs
When does the neutral point matter? When comparing interventions that affect mortality to those that don't. For comparisons among living people only, the neutral point typically cancels out: since both treatment and control are measured from the same baseline, subtracting that baseline from both sides leaves the difference unchanged.[14]
Empirical neutral point estimates (0.6 to 6.0 range) — click to expand
Recent work has attempted to estimate the neutral point empirically. The estimates vary widely (0.6 to 6.0 on a 0-10 scale) depending on the elicitation method and sample. HLI is advising the UK Green Book guidelines and has compiled the following estimates:[*]Table compiled by Samuel Dupret (HLI), shared March 2026. Note that different questions elicit very different values—asking about "life no longer worth living" yields lower estimates than asking about "minimally acceptable" levels.
| Source | Value | Method | Sample |
|---|---|---|---|
| Samuelsson et al. (2023) - HLI pilot | 1.26 | Asked when life is no longer worth living on 0-10 LS scale | N=79, UK |
| Samuelsson et al. (2023) - HLI pilot | 5.30 | Asked where balance between satisfied/dissatisfied on 0-10 LS | N=128, UK |
| Peasgood et al. (unpublished) | 2.00 | Time trade-offs (QALY method rather than wellbeing scale) | N=75, UK |
| IDinsight Beneficiary Survey (2019) | 0.56 | "At what point on the ladder is it worse than dying?" | N=70, Ghana & Kenya |
| Moss (Rethink Priorities, unpublished) | 2.49 | Asked level preferring alive to dead (converted 0-100 → 0-10) | N=35, likely UK |
| Moss (Rethink Priorities, unpublished) | 6.05 | Asked minimally acceptable level to live an extra year | N=101, likely UK |
| Jamison et al. (forthcoming) | 2.39 | Policy comparison: saving people from dying (0-100 → 0-10) | N=1800 (Brazil, China, UK) |
| Jamison et al. (forthcoming) | 2.54 | Policy comparison: saving people from non-existence (0-100 → 0-10) | N=1800 (Brazil, China, UK) |
Key observations:
- Question framing matters enormously: "Life no longer worth living" (1.26) vs "minimally acceptable" (6.05) yields 5× difference
- The only LMIC estimate (IDinsight, Ghana/Kenya) is the lowest at 0.56—though small sample and different question framing
- Jamison et al. provides the largest cross-country sample (N=1800 across Brazil, China, UK) with estimates around 2.4-2.5
- HM Treasury (UK Green Book) currently uses 1.0 (based on suicide rates); Frijters & Krekel recommend 2.0
7. Evidence and alternatives
Reliability: noisy but not useless
Single-item life evaluations have test-retest correlations around 0.5-0.7 over short windows.[5]Krueger & Schkade (2008). "The reliability of subjective well-being measures." This means measurement error attenuates estimated effects—small real effects may be undervalued. For relative comparisons: if measurement error is similar across interventions and contexts, attenuation affects absolute magnitudes but may preserve relative rankings. However, if noise levels differ (e.g., different survey modes, cultural contexts, or outcome domains), relative comparisons become less reliable.
Predictive validity
What this means: "Predictive validity" asks whether LS scores relate to real-world behaviors and outcomes in expected ways. If LS contains meaningful information about welfare, lower scores should predict people making costly changes (moving, quitting jobs, leaving relationships) to improve their situation.
Kaiser & Oswald show that single numeric feelings responses do predict these consequential outcomes—relationships tend to be replicable and approximately linear.[6]Kaiser & Oswald (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS. See also Plant (2024) for additional evidence that LS measures predict consequential outcomes and respond to changes (like income transfers) we expect to matter. This is notable because a single "made-up feelings integer" turns out to have more predictive power for subsequent costly actions (e.g., moving house, leaving a partner, quitting a job, visiting a hospital) than a collection of standard socioeconomic variables including income, employment, and education.[27]Kaiser & Oswald (2022), Table 2: a single feelings integer outperforms combined socioeconomic predictors for "get-me-out-of-here" actions across UK, German, and Australian panel data. For example, moving from LS=3 to LS=7 is associated with roughly halving the probability of leaving one's neighbourhood in the subsequent period. This suggests LS captures something meaningful, though it doesn't directly prove magnitudes are comparable across people.
Note: Predictive validity is necessary but not sufficient for using WELLBYs. A measure could predict outcomes yet still be noisy, inconsistent across groups, or respond to treatments in ways that don't reflect welfare gains. It establishes that LS isn't pure noise—not that it's ready for precise cross-intervention comparison.
Response times as identification: Liu & Netzer (2023) propose using survey response times to help solve identification problems in ordered response models.[28]Liu, S. & Netzer, N. (2023). "Happy Times: Measuring Happiness Using Response Times." AER, 113(12), 3289-3322. The intuition: if two people both report "7" but one answers instantly while another deliberates, the fast responder is likely further from the boundary between "7" and adjacent categories—their latent happiness is more clearly in the "7" range. This "chronometric effect" provides distributional information that the response alone cannot. The counterintuitive insight is that how quickly someone answers a happiness question contains information about the distribution of their underlying wellbeing—not just their point estimate. Their evidence is mixed: conventional distributional assumptions are rejected in some cases but broadly supported overall.
LMIC evidence: GiveDirectly cash transfers
Kenya RCT (Haushofer & Shapiro)
- Short-run (~9 months): Life satisfaction +0.17 SD, happiness +0.16 SD (WVS measures)
- Long-run (~3 years): Life satisfaction +0.08 SD (statistically significant); psychological wellbeing index +0.16 SD
These findings show LS measures have detectable signal in LMIC RCTs, not just noise—and effects can persist.[7]Haushofer & Shapiro (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. Also 2018 long-term follow-up working paper.
More evidence: See HLI's comprehensive GiveDirectly summary for additional studies and meta-analytic estimates of cash transfer effects on wellbeing.
The outcome translation problem
Many LMIC mental health studies report depression scales or symptom indices, not standard 0-10 life satisfaction. GiveWell's assessment of StrongMinds explicitly highlights uncertainty in translating depression improvements into life satisfaction gains.[gw]GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds." Notes substantial uncertainty in depression-to-LS mapping.
Even if you accept WELLBY as the target unit, outcome translation forces choices: use DALYs/QALYs (more standard in health evaluation) even if they miss non-health welfare, use life satisfaction directly but only where trials collect it, or use mapping models—statistical relationships estimated from observational data that predict life satisfaction from other measures (e.g., "a 1-SD reduction in PHQ-9 depression score corresponds to ~0.4 points higher LS")—but carry mapping uncertainty explicitly.
Comparison with alternatives
8. WELLBY Calculator
Incremental WELLBY Estimate
Enter treatment effect, duration, recipients, and cost to estimate total WELLBYs and cost-effectiveness.
This calculator assumes constant effect size. Real applications should account for effect decay, discounting, and uncertainty.
9. Considerations for discussion
The following are practices and considerations found in the literature—not workshop conclusions or endorsements. A key goal of the workshop is to assess which of these, if any, represent genuine improvements over current practice, and what their practical limitations are.
What are funders currently doing?
Some funders and evaluators (e.g., HLI, Founders Pledge) report using approaches such as:
- Sensitivity analyses across different neutral points and scale-comparability assumptions
- Reporting how intervention rankings change under alternative assumptions
- Combining WELLBY estimates with other evidence types (e.g., objective indicators, expert judgment)
Whether these practices adequately address the measurement concerns discussed above is itself a question for discussion.
What methodological improvements have been proposed?
Researchers have proposed various approaches to strengthen WELLBY-based comparisons. Each involves trade-offs between feasibility and rigor:
- Standardized life satisfaction questions (e.g., the OECD recommended module, updated 2025) for cross-study comparability
- Long-term follow-up to capture duration effects and potential hedonic adaptation
- Calibration questions or vignettes when comparing across populations (Benjamin et al., 2023)
- Pre-registration of analysis plans including functional form choices
Under what conditions might WELLBY comparisons be more reliable?
Several conditions have been suggested in the literature as potentially supporting stronger inference, though each carries its own caveats:
- Large effect sizes that overwhelm measurement error
- Within-person designs where each person serves as their own control[19]Caveat: Within-person designs can introduce demand effects—participants may feel motivated to report improvement to please researchers. This is a particular concern for psychosocial interventions with non-blinded delivery.
- Multiple wellbeing measures showing the same directional effect
- Triangulation with behavioral or objective outcomes
10. Open questions (research agenda)
The following questions have been identified by researchers in this area (including workshop participants and authors cited above) as areas where additional evidence could meaningfully improve the reliability of WELLBY-based comparisons:
- Neutral point estimation: What is the actual neutral point on the 0-10 scale for different populations? How stable is it across contexts?[16]Limited LMIC-specific research exists on neutral point estimation. Peasgood et al. (2018) provide estimates for UK populations; whether these generalize to LMIC contexts is unclear.
- Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more?
- Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive burden?
- WELLBY-DALY relationship: What's the mapping between WELLBYs and DALYs, and is it linear? How much does it vary by health condition?
- Demand effects and response shift: How do experimenter demand effects and response shift vary by intervention type?
11. Workshop prompts
Neutral prompts for workshop deliberation:
1. For which classes of intervention comparisons (same setting/instrument vs. cross-study) does the linear WELLBY seem most defensible, and why?
2. Which assumptions are most likely to be materially violated in LMIC contexts: linearity, intertemporal comparability, interpersonal comparability, or scale-use heterogeneity?
3. When does the neutral point become decision-relevant? Which "zero" do you have in mind?
4. How should analysts treat "mapping" between depression scales and life satisfaction when LS isn't measured? What minimum evidence would make a mapping credible?
5. Which low-burden calibration approaches seem most promising for LMIC settings?
6. How should the choice between DALYs, QALYs, and WELLBYs depend on the type of intervention being evaluated? Are there cases where one metric is clearly more appropriate?
7. Given current evidence and uncertainty, what would you want to see change in how funders use WELLBY estimates? What would make you more (or less) confident in their use for cross-intervention comparison?
References
- Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Health Economics, 29(12).
- OECD (2013; 2025 update). Guidelines on Measuring Subjective Well-being. The 2025 update incorporates a decade of measurement experience and concludes that SWB data are meaningful for policy despite critiques, while recommending improved harmonization.
- Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." Journal of Political Economy, 127(4).
- Benjamin, D.J. et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
- Krueger, A.B. & Schkade, D.A. (2008). "The reliability of subjective well-being measures." Journal of Public Economics.
- Kaiser, C. & Oswald, A.J. (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS, 119(42).
- Haushofer, J. & Shapiro, J. (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. See also 2018 long-term follow-up.
- HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.
- Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Behavioural Public Policy, 4(2). Introduces the WELLBY framework.
- GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds."
- Benjamin, D.J., Heffetz, O., Kimball, M.S. & Szembrot, N. (2014). "Beyond Happiness and Satisfaction: Toward Well-Being Indices Based on Stated Preference." AER, 104(9). The shifters/stretchers framework is elaborated in Benjamin et al. (2023) NBER WP 31728.
- Plant, M. (2024). "A Happy Possibility About Happiness (And Other Subjective) Scales: An Investigation and Tentative Defence of the Cardinality Thesis." Wellbeing Research Centre working paper.
- McGuire, J. et al. (2024). "The Wellbeing Cost-Effectiveness of StrongMinds and Friendship Bench." Happier Lives Institute Report. Note: HLI's scope is global, not limited to LMICs—they evaluate interventions worldwide. See also their GiveDirectly analysis, response to GiveWell, and other CEAs.
- When comparing intervention effects for living people, the neutral point cancels algebraically: (LS_treatment − LS₀) − (LS_control − LS₀) = LS_treatment − LS_control.
- LMIC study examples: Haushofer & Shapiro (2016, 2018) measured SWB at 9 months and 3 years post cash transfers in Kenya; StrongMinds evaluations typically follow up at 3-6 months.
- Limited LMIC-specific research exists on neutral point estimation. Peasgood et al. (2018) provide estimates for UK populations; whether these generalize to LMIC contexts is unclear.
- DALYs (Disability-Adjusted Life Years) = Years of Life Lost (YLL) + Years Lived with Disability (YLD). Disability weights are derived from Global Burden of Disease (GBD) population surveys where respondents rate hypothetical health states—not from the affected individual's own experience. WELLBYs, by contrast, ask each individual to rate their own life satisfaction directly.
- "Incremental" vs "level-based" is descriptive language used in this document to distinguish two common uses of WELLBYs. The academic literature typically just refers to "WELLBYs" with context clarifying whether the focus is on intervention effects (changes) or absolute welfare levels (for mortality calculations).
- Within-person designs reduce scale-use heterogeneity across individuals but may introduce demand effects—participants may feel motivated to report improvement to please researchers, especially for psychosocial interventions with non-blinded delivery.
- "Cancels out" in randomization context: if scale-use differences (shifters and stretchers) are randomly distributed between treatment and control groups, within-study comparisons remain valid—the average effect estimate is unbiased. The problem arises when (a) comparing across studies with different scale-use distributions, or (b) treatment itself changes the reporting function (response shift, discussed above), which randomization cannot address.
- Benjamin, D.J., Heffetz, O., Kimball, M.S. & Rees-Jones, A. (2012). "What Do You Think Would Make You Happier?" JEP; Benjamin et al. (2014) "Beyond Happiness and Satisfaction" AER. Benjamin, Cooper, Heffetz, Kimball & Kundu (2026). "What Do People Want?" (working paper) finds life satisfaction itself ranks relatively low among things people want. Importantly, social desirability concerns lead people to understate how much they care about mental health.
- Scale-adjusted SWB: Subjective wellbeing scores adjusted for scale-use heterogeneity—different people use the 0-10 scale differently ("shifters" use consistently higher/lower values; "stretchers" use more/less of the scale's range). Methods include vignette anchoring and statistical corrections. See Benjamin et al. (2023) "Adjusting for Scale-Use Heterogeneity" NBER WP 31728.
- Interpersonal comparability of levels (LS_A = 7 implies U_A = U_B when LS_B = 7) is not necessary for intervention comparison if we can instead assume that differences are equivalent across people. If Person A experiences higher welfare at all reported scores but the differences between scores are comparable, then within-person changes can still be meaningfully compared across individuals. This weaker condition—comparability of scale intervals rather than scale positions—is what "unit-change comparability" captures.
- Kaiser & Oswald (2022), Table 2: a single "made-up feelings integer" outperforms combined socioeconomic predictors (income, employment, education, children, homeownership) for predicting subsequent "get-me-out-of-here" actions—moving house, leaving a partner, quitting a job, visiting a hospital—across UK (BHPS), German (SOEP), and Australian (HILDA) panel data.
- Liu, S. & Netzer, N. (2023). "Happy Times: Measuring Happiness Using Response Times." American Economic Review, 113(12), 3289-3322. Response times provide information about the distribution of latent wellbeing through a "chronometric effect": people who are far from the boundary between adjacent response categories answer faster.
- The DALY framework assumes that a disability weight of, say, 0.5 means living one year in that state is equivalent to losing half a year of life. This equivalence is built into the metric by construction, not empirically verified. For mental health conditions, where subjective experience may diverge from external hypothetical ratings, this assumption is particularly uncertain.
- QALYs (Quality-Adjusted Life Years) = years of life x health utility weight (0-1 scale). Unlike DALYs, QALYs often derive weights from patient self-assessments via instruments like EQ-5D, making them somewhat closer to self-report. However, they remain anchored to health-state descriptions rather than overall life satisfaction, and may not capture mental health conditions well. See NICE technology appraisal guidance for institutional usage.
Related Analysis
For discussion of how to convert between DALYs/QALYs and WELLBYs:
DALY/QALY↔WELLBY Conversion →Note: AI-generated draft requiring verification