WELLBY Reliability Discussion | Wellbeing Workshop

SEGMENT 2 30 minutes (11:40 AM–12:10 PM ET)

Presenter: Matt Lerner (Founders Pledge) · Discussant: Caspar Kaiser (U of Warwick)

Lerner presents the practitioner perspective on PQ1 (WELLBY reliability for funding decisions). Kaiser discusses key barriers to WELLBY adoption: comparability, linearity, the neutral point problem, and whether WELLBYs capture the right concepts. Previews Benjamin et al. on scale-use heterogeneity, presented in detail after the break.

Focal Question (WELL_01)

What combination of (a) subjective wellbeing survey data, (b) income and health-outcome data, (c) metrics based on this data (e.g., linear or logarithmic WELLBYs, standard deviations, scale-use adjustments), and (d) possible conversions between different measures would be "best" for making funding choices between interventions which may impact mental health, physical health, and/or consumption?

Overview

This open discussion segment addresses the core reliability question: given what we know about scale-use heterogeneity[1]Scale-use heterogeneity: different individuals interpret and use the 0–10 life satisfaction scale differently. Benjamin et al. (2023) show this can substantially bias cross-group comparisons; the magnitude depends on context and comparison groups. and measurement challenges, is the linear WELLBY[2]"Linear" WELLBY assumes equal intervals: moving from 3→4 equals the same welfare gain as 7→8. This cardinality assumption enables summing across people, but may not hold at scale extremes. measure reliable enough for comparing interventions across mental health, physical health, and consumption domains?

Discussion Prompts

What are the strongest arguments for and against using linear WELLBYs?
In which contexts does the linear WELLBY perform well vs. poorly?[3]Possible factors to consider: population similarity, effect size relative to measurement noise, within-person vs cross-sectional designs. What does the evidence say?
What calibration or adjustment approaches seem most promising?[4]Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
How much precision do we lose by using simple approaches?

Relevant Pivotal Questions

This discussion directly addresses several of our Pivotal Questions:

WELL_01: What combination of SWB data, metrics, and conversions would be "best" for funding choices?
WELL_01a: How reliable is the linear WELLBY for cross-intervention comparison?
WELL_04: How much does scale-use heterogeneity bias WELLBY comparisons?
WELL_07: What calibration approaches are most promising for LMIC contexts?

Institutional Context

GiveWell (2023): Conducted analysis of StrongMinds using WELLBYs.[5]GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric. Per their moral weights page: "we believe that subjective well-being deserves more study."
IDinsight (2025): GiveWell-funded research on beneficiary preference trade-offs[6]IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.

Collaborative Notes

Open in new tab →

Questions & Comments

Add questions and comments directly to the collaborative notes above.

📄 Background: Linear WELLBY Analysis

This document maps the key issues we'll discuss: cardinality assumptions, Bond & Lang's identification critique, scale-use heterogeneity (shifters vs. stretchers), and what calibration methods can and can't fix.

View Analysis →

AI-assisted draft (Mar 2026) — annotate errors directly.

📚 Related Evidence: Unjournal Evaluations

These independently evaluated papers bear on WELLBY reliability questions—particularly effect duration, cross-metric comparison, and scale-up concerns:

Long-Run Effects of Psychotherapy — Effect durability is often the largest uncertainty in WELLBY calculations
Cash Transfers vs Psychotherapy (Liberia) — Direct cross-metric comparison in LMIC context
StrongMinds & Friendship Bench — Evaluators assess HLI's WELLBY methodology

← Stakeholder Problem Next: DALY/WELLBY Conversion →

Notes

Scale-use heterogeneity: different individuals interpret and use the 0–10 life satisfaction scale differently. Benjamin et al. (2023) show this can substantially bias cross-group comparisons; the magnitude depends on context and comparison groups.
"Linear" WELLBY assumes equal intervals: moving from 3→4 equals the same welfare gain as 7→8. This cardinality assumption enables summing across people, but may not hold at scale extremes.
Possible factors to consider: population similarity, effect size relative to measurement noise, within-person vs cross-sectional designs. What does the evidence say?
Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric.
IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.