The Unjournal · Pivotal Questions Initiative

WELLBY Reliability Discussion

Is the linear WELLBY reliable enough for cross-intervention comparison?

💬 Annotate this page — select any text to comment via Hypothes.is (free account to post; anyone can read)
SEGMENT 2 30 minutes (11:40 AM–12:10 PM ET)

Presenter: Matt Lerner (Founders Pledge) · Discussant: Caspar Kaiser (U of Warwick)

Lerner presents the practitioner perspective on PQ1 (WELLBY reliability for funding decisions). Kaiser discusses key barriers to WELLBY adoption: comparability, linearity, the neutral point problem, and whether WELLBYs capture the right concepts. Previews Benjamin et al. on scale-use heterogeneity, presented in detail after the break.

Focal Question (WELL_01)

What combination of (a) subjective wellbeing survey data, (b) income and health-outcome data, (c) metrics based on this data (e.g., linear or logarithmic WELLBYs, standard deviations, scale-use adjustments), and (d) possible conversions between different measures would be "best" for making funding choices between interventions which may impact mental health, physical health, and/or consumption?

Overview

This open discussion segment addresses the core reliability question: given what we know about scale-use heterogeneity[1]Scale-use heterogeneity: different individuals interpret and use the 0–10 life satisfaction scale differently. Benjamin et al. (2023) show this can substantially bias cross-group comparisons; the magnitude depends on context and comparison groups. and measurement challenges, is the linear WELLBY[2]"Linear" WELLBY assumes equal intervals: moving from 3→4 equals the same welfare gain as 7→8. This cardinality assumption enables summing across people, but may not hold at scale extremes. measure reliable enough for comparing interventions across mental health, physical health, and consumption domains?

Discussion Prompts

Relevant Pivotal Questions

This discussion directly addresses several of our Pivotal Questions:

Institutional Context

Collaborative Notes

Open in new tab →

Questions & Comments

Add questions and comments directly to the collaborative notes above.

📄 Background: Linear WELLBY Analysis

This document maps the key issues we'll discuss: cardinality assumptions, Bond & Lang's identification critique, scale-use heterogeneity (shifters vs. stretchers), and what calibration methods can and can't fix.

View Analysis →

AI-assisted draft (Mar 2026) — annotate errors directly.

📚 Related Evidence: Unjournal Evaluations

These independently evaluated papers bear on WELLBY reliability questions—particularly effect duration, cross-metric comparison, and scale-up concerns:

Notes

  1. Scale-use heterogeneity: different individuals interpret and use the 0–10 life satisfaction scale differently. Benjamin et al. (2023) show this can substantially bias cross-group comparisons; the magnitude depends on context and comparison groups.
  2. "Linear" WELLBY assumes equal intervals: moving from 3→4 equals the same welfare gain as 7→8. This cardinality assumption enables summing across people, but may not hold at scale extremes.
  3. Possible factors to consider: population similarity, effect size relative to measurement noise, within-person vs cross-sectional designs. What does the evidence say?
  4. Options include: vignette anchoring, calibration questions, multi-item scales, experience sampling. Trade-offs involve respondent burden vs. precision. See Benjamin et al. for empirical comparison.
  5. GiveWell's StrongMinds analysis explored valuing mental health benefits via WELLBYs rather than income-equivalents. They concluded SWB "deserves more study" but didn't adopt WELLBYs as primary metric.
  6. IDinsight's research uses stated preference surveys to understand how beneficiaries weigh different outcomes (income, health, life satisfaction). This provides an alternative approach to comparing welfare across domains.