Presenter: Matt Lerner (Founders Pledge) · Discussant: Caspar Kaiser (U of Warwick)
Lerner presents the practitioner perspective on PQ1 (WELLBY reliability for funding decisions). Kaiser discusses key barriers to WELLBY adoption: comparability, linearity, the neutral point problem, and whether WELLBYs capture the right concepts. Previews Benjamin et al. on scale-use heterogeneity, presented in detail after the break.
Focal Question (WELL_01)
What combination of (a) subjective wellbeing survey data, (b) income and health-outcome data, (c) metrics based on this data (e.g., linear or logarithmic WELLBYs, standard deviations, scale-use adjustments), and (d) possible conversions between different measures would be "best" for making funding choices between interventions which may impact mental health, physical health, and/or consumption?
Overview
This open discussion segment addresses the core reliability question: given what we know about scale-use heterogeneity[1]Scale-use heterogeneity: different individuals interpret and use the 0–10 life satisfaction scale differently. Benjamin et al. (2023) show this can substantially bias cross-group comparisons; the magnitude depends on context and comparison groups. and measurement challenges, is the linear WELLBY[2]"Linear" WELLBY assumes equal intervals: moving from 3→4 equals the same welfare gain as 7→8. This cardinality assumption enables summing across people, but may not hold at scale extremes. measure reliable enough for comparing interventions across mental health, physical health, and consumption domains?
Discussion Prompts
- What are the strongest arguments for and against using linear WELLBYs?
- In which contexts does the linear WELLBY perform well vs. poorly?[3]Possible factors to consider: population similarity, effect size relative to measurement noise, within-person vs cross-sectional designs. What does the evidence say?
- What calibration or adjustment approaches seem most promising?[4]Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
- How much precision do we lose by using simple approaches?
Relevant Pivotal Questions
This discussion directly addresses several of our Pivotal Questions:
- WELL_01: What combination of SWB data, metrics, and conversions would be "best" for funding choices?
- WELL_01a: How reliable is the linear WELLBY for cross-intervention comparison?
- WELL_04: How much does scale-use heterogeneity bias WELLBY comparisons?
- WELL_07: What calibration approaches are most promising for LMIC contexts?
Institutional Context
- GiveWell (2023): Conducted analysis of StrongMinds using WELLBYs.[5]GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric. Per their moral weights page: "we believe that subjective well-being deserves more study."
- IDinsight (2025): GiveWell-funded research on beneficiary preference trade-offs[6]IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.
📄 Background: Linear WELLBY Analysis
This document maps the key issues we'll discuss: cardinality assumptions, Bond & Lang's identification critique, scale-use heterogeneity (shifters vs. stretchers), and what calibration methods can and can't fix.
View Analysis →AI-assisted draft (Mar 2026) — annotate errors directly.
📚 Related Evidence: Unjournal Evaluations
These independently evaluated papers bear on WELLBY reliability questions—particularly effect duration, cross-metric comparison, and scale-up concerns:
- Long-Run Effects of Psychotherapy — Effect durability is often the largest uncertainty in WELLBY calculations
- Cash Transfers vs Psychotherapy (Liberia) — Direct cross-metric comparison in LMIC context
- StrongMinds & Friendship Bench — Evaluators assess HLI's WELLBY methodology
Notes
- Scale-use heterogeneity: different individuals interpret and use the 0–10 life satisfaction scale differently. Benjamin et al. (2023) show this can substantially bias cross-group comparisons; the magnitude depends on context and comparison groups.
- "Linear" WELLBY assumes equal intervals: moving from 3→4 equals the same welfare gain as 7→8. This cardinality assumption enables summing across people, but may not hold at scale extremes.
- Possible factors to consider: population similarity, effect size relative to measurement noise, within-person vs cross-sectional designs. What does the evidence say?
- Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
- GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric.
- IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.