Presenter: Matt Lerner (Founders Pledge) ยท Discussant: Caspar Kaiser (U of Warwick)
Lerner presents the practitioner perspective on PQ1 (WELLBY reliability for funding decisions). Kaiser discusses key barriers to WELLBY adoption: comparability, linearity, the neutral point problem, and whether WELLBYs capture the right concepts. Previews Benjamin et al. on scale-use heterogeneity, presented in detail after the break.
Focal Question (WELL_01)
What combination of (a) subjective wellbeing survey data, (b) income and health-outcome data, (c) metrics based on this data (e.g., linear or logarithmic WELLBYs, standard deviations, scale-use adjustments), and (d) possible conversions between different measures would be "best" for making funding choices between interventions which may impact mental health, physical health, and/or consumption?
Overview
This open discussion segment addresses the core reliability question: given what we know about scale-use heterogeneity[1]Scale-use heterogeneity: different individuals interpret and use the 0โ10 life satisfaction scale differently. Benjamin et al. (2023) estimate this can bias cross-group comparisons by 30-50%. and measurement challenges, is the linear WELLBY[2]"Linear" WELLBY assumes equal intervals: moving from 3โ4 equals the same welfare gain as 7โ8. This cardinality assumption enables summing across people, but may not hold at scale extremes. measure reliable enough for comparing interventions across mental health, physical health, and consumption domains?
Discussion Prompts
- What are the strongest arguments for and against using linear WELLBYs?
- In which contexts does the linear WELLBY perform well vs. poorly?[3]Linear WELLBY tends to work better when: (1) comparing similar populations, (2) effect sizes are large relative to measurement noise, (3) within-person longitudinal designs are used.
- What calibration or adjustment approaches seem most promising?[4]Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
- How much precision do we lose by using simple approaches?
Relevant Pivotal Questions
This discussion directly addresses several of our Pivotal Questions:
- WELL_01: What combination of SWB data, metrics, and conversions would be "best" for funding choices?
- WELL_01a: How reliable is the linear WELLBY for cross-intervention comparison?
- WELL_04: How much does scale-use heterogeneity bias WELLBY comparisons?
- WELL_07: What calibration approaches are most promising for LMIC contexts?
Institutional Context
- GiveWell (2023): Conducted analysis of StrongMinds using WELLBYs.[5]GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric. Per their moral weights page: "we believe that subjective well-being deserves more study."
- IDinsight (2025): GiveWell-funded research on beneficiary preference trade-offs[6]IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.
๐ Background: Linear WELLBY Analysis
This document maps the key issues we'll discuss: cardinality assumptions, Bond & Lang's identification critique, scale-use heterogeneity (shifters vs. stretchers), and what calibration methods can and can't fix.
View Analysis โAI-assisted draft (Mar 2026) โ annotate errors directly.
Notes
- Scale-use heterogeneity: different individuals interpret and use the 0โ10 life satisfaction scale differently. Benjamin et al. (2023) estimate this can bias cross-group comparisons by 30-50%.
- "Linear" WELLBY assumes equal intervals: moving from 3โ4 equals the same welfare gain as 7โ8. This cardinality assumption enables summing across people, but may not hold at scale extremes.
- Linear WELLBY tends to work better when: (1) comparing similar populations, (2) effect sizes are large relative to measurement noise, (3) within-person longitudinal designs are used.
- Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
- GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric.
- IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.