⚠️ AI-Generated Content (March 2026) — click to expand
This page was generated with AI assistance (Claude Code + ChatGPT deep research) and revised based on 75+ Hypothes.is workshop comments.
The content aims to be workshop-neutral, framing issues for deliberation rather than prescribing conclusions. Readers should verify quantitative claims against the original literature.
Audio Version (~25 min)
Listen to an audio narration of this page (British academic voice):
Download MP3 (8 MB) Text ScriptGenerated with Microsoft Edge TTS (en-GB-RyanNeural)
1. The decision problem
Organizations comparing interventions—especially in low- and middle-income countries (LMICs)—face a measurement problem: interventions change different things (mortality, morbidity, consumption, mental health, social cohesion). The WELLBY approach proposes translating these into a common unit based on subjective wellbeing, enabling "welfare impact per dollar" comparisons.[1]Frijters, Clark, Krekel & Layard (2020). "A Happy Choice: Wellbeing as the Goal of Government."
A focal question for this workshop: How reliably can we compare interventions, especially in LMICs, by aggregating changes in reported wellbeing (the WELLBY approach), especially across different studies and contexts? This is one of several comparison frameworks; others include DALY/QALY-based approaches, capability approaches, and direct monetary valuation. The workshop examines the linear WELLBY's reliability relative to these alternatives, not in isolation.
🔬 The Focal Case: StrongMinds vs AMF (click to expand)
The debate over whether mental health interventions outperform anti-malaria bednets illustrates why WELLBY methodology matters. This comparison—central to effective altruism discourse—hinges on contested assumptions about measuring and aggregating wellbeing gains.
The Interventions
Group interpersonal therapy (IPT-G) for depression, primarily in Uganda, Zambia, and other African countries. Community health workers deliver 12-week group sessions to women with moderate-to-severe depression.
Distributes long-lasting insecticide-treated bednets (LLINs) in malaria-endemic regions. GiveWell's top-rated charity for over a decade, with extensive mortality reduction evidence.
Problem-solving therapy delivered by trained community health workers ("grandmother counselors") on park benches in Zimbabwe. The Chibanda et al. (2016) cluster RCT showed significant effects on depression symptoms.
Unconditional cash transfers (~$1,000) to poor households in Kenya and Uganda. Haushofer & Shapiro (2016) RCT measured life satisfaction directly, finding 0.16 SD improvement at 9 months.
What Data Was Collected?
The Translation Challenge
To compare these interventions, evaluators must translate different metrics into a common currency:
- AMF → WELLBYs: Convert DALYs averted to life-years, then assign WELLBY values to those years. Requires assumptions about the wellbeing level of lives saved.
- StrongMinds → WELLBYs: Convert depression scale changes (PHQ-9) to life satisfaction changes using cross-sectional correlations. Then extrapolate effect duration beyond measured follow-up.
- GiveDirectly → WELLBYs: Direct measurement of LS changes, but requires duration assumptions (do effects persist?).
📚 Further reading: See Unjournal evaluations of mental health research for independent expert assessments of the underlying evidence.
The measurement-to-decision pipeline illustrates why comparing interventions requires multiple translation steps. Each box represents a stage where methodological choices affect final conclusions:
LS / DALY / depression] C --> D[Translation layer
mapping, calibration] D --> E[Common currency
WELLBY / DALY / $] E --> F[Decision]
How to read this diagram (click to collapse)
- Intervention → Study design: The program being evaluated is studied through some research design (RCT, quasi-experiment, etc.)
- Study design → Measured outcomes: Studies measure different things—some use life satisfaction (LS), others use DALYs or depression scales
- Measured outcomes → Translation layer: Different metrics must be mapped or calibrated to enable comparison
- Translation layer → Common currency: The goal is a single unit (WELLBYs, DALYs, or dollars) enabling "apples-to-apples" comparison
- Common currency → Decision: Funders use the common currency to prioritize interventions
Each arrow involves assumptions that can introduce error or bias. The workshop focuses on where these assumptions are most likely to matter.
Workshop goals: (1) Clarity about which assumptions matter most for which comparisons, and what evidence would change views. (2) Share information and synthesize participant expertise. (3) Generate practical insights and actionable recommendations for funders working with current evidence.
2. Definitions and key concepts
WELLBY Definition
1 WELLBY = a one-point change in life satisfaction (0-10 scale) × 1 person × 1 year
Source: UK Green Book Wellbeing Guidance (HM Treasury, 2021/2024)
Origins, alternative definitions, and adoption
Original proposal: Frijters, Clark, Krekel & Layard (2020) introduced WELLBYs in Health Economics as a unit for comparing wellbeing gains across policy domains.
Alternative definitions: Most usage defines WELLBY using life satisfaction (Cantril ladder, 0-10), but some researchers use affect-based measures (experienced happiness). The choice matters: life satisfaction captures evaluative wellbeing; affect captures momentary experience.
Organizational adoption:
- Happier Lives Institute (Plant et al.): Primary metric for charity comparison; extensive cost-effectiveness analyses[13]HLI cost-effectiveness analyses; see also Plant (2024) on cardinality.
- Founders Pledge: Uses WELLBYs alongside DALYs for mental health CEA
- GiveWell: Has explored WELLBY analysis but with significant reservations
- UK Government: Official guidance for policy appraisal
Standard Life Satisfaction Questions
OECD single-item: "Overall, how satisfied are you with your life as a whole these days?" (0 = "not at all satisfied" to 10 = "completely satisfied")[2]OECD Guidelines on Measuring Subjective Well-being (2013/2024). Question modules available at oecd.org.
Cantril ladder (Gallup): "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you, and the bottom represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"
These two framings—satisfaction vs. ladder position—are often used interchangeably, but may capture subtly different constructs. Cross-study comparisons should note which instrument (survey question format) was used.
Incremental vs. Level-Based Accounting
Where: $i$ = individuals, $t$ = time periods, $LS$ = life satisfaction (0-10), $\delta$ = discount factor, $k$ = intervention, $0$ = counterfactual. See notation key below for details.
Level-based WELLBYs (relevant for comparing interventions that affect mortality or, in some accounting, birth rates):
Notation key
- $i$ = individual (summing across people)
- $t$ = time period (summing across years)
- $LS$ = Life Satisfaction score (0-10)
- $\delta$ = discount factor for future years
- $k$ = intervention; $0$ = counterfactual
In practice, RCTs estimate $LS^{(k)} - LS^{(0)}$ directly via experimental comparison of treatment and control group outcomes.
Technical definitions (reporting function, instrument, latent distribution)
Instrument: The specific measurement tool—exact question wording, response format (0-10 vs 1-7), anchors, translation, survey mode.
Reporting function: The internal process by which a person translates their true wellbeing ($u$) into a number on the survey scale. Formally: $LS_{it} = f_i(u_{it}) + \varepsilon_{it}$. Different people may have different reporting functions—one person's "7" might correspond to another's "5" for the same underlying welfare level. This is the core of scale-use heterogeneity.
Latent distribution: The unobserved underlying welfare distribution. Since we only see reported scores, conclusions can depend on assumptions about this hidden distribution.
3. Core assumptions
Using linear WELLBYs for cross-intervention comparison requires assumptions.[B]Benjamin, Cooper, Heffetz & Kimball (2024). "From Happiness Data to Economic Conclusions." Annual Review of Economics. Articulates four key assumptions for SWB-based welfare analysis. The claim "equal scores mean equal welfare" is stronger than most applications need—what matters for intervention comparison is whether changes in scores reflect comparable welfare changes.
Cardinality (linearity) ⚠ Key contested assumption
Equal intervals on the scale imply equal welfare differences: moving from 3→4 equals the same welfare gain as 7→8. If violated, summing may distort comparisons.[12]Plant (2024) explores conditions under which treating LS as cardinal is defensible. See also HLI's cost-effectiveness methodology.
Many practitioners consider this the most consequential assumption—and the least obviously defensible. See the Bond & Lang critique for why scale transformations matter.
Unit-change comparability
A one-point change has approximately the same welfare meaning across people. This is weaker than requiring equal levels to mean equal welfare.
Why this is sufficient: For comparing intervention effects, we don't need equal levels to mean equal welfare—only that changes are comparable. If Person A is always 2 points happier than B at any objective welfare level, that doesn't bias intervention comparisons as long as a 1-point gain means the same for both.
Trade-off implication: Would you trade 2 years at LS=7 for 1 year at LS=9? If a one-point change has the same welfare meaning, these should be equivalent (2 WELLBYs each). If not—if gains at higher levels feel "less valuable"—the assumption is violated.
Temporal aggregation
Integrating wellbeing over time is meaningful. May fail if adaptation returns people to baseline, or if respondents reinterpret the scale over time (response shift).
Cross-domain capture (contested)
Life satisfaction aims to incorporate welfare from many domains (health, income, relationships). However, the key question is whether our measures capture what actually matters for individuals' prudential value—this could be life satisfaction, happiness, transient affect, or something else entirely. Life satisfaction may only capture some aspects of what matters, and more work is needed at the intersection of social science and philosophy to clarify this.
Time structure and discounting
Why this matters for intervention comparison: WELLBY calculations aggregate wellbeing gains over time. If Intervention A produces a 0.5-point LS boost for 2 years while Intervention B produces a 0.3-point boost for 5 years, B yields more total WELLBYs—but only if the effects actually persist as assumed. The time dimension is often the largest source of uncertainty.
Most studies measure outcomes at baseline and one or two follow-ups;[15]LMIC examples: Haushofer & Shapiro (2016, 2018) measure SWB at 9 months and 3 years after Kenya cash transfers; StrongMinds studies typically follow up at 3-6 months post-treatment; GiveDirectly studies (Kenya, Uganda) measure at 1-3 years; BRAC graduation programs measure at 2+ years. Most LMIC wellbeing RCTs have 1-2 follow-up points with limited long-term tracking. extrapolating beyond measured timepoints requires assumptions about persistence (does the effect last?) and discounting (are future wellbeing gains worth less than present ones?).
Response Shift: a distinct threat
Even if baseline scale-use heterogeneity cancels out[20]"Cancels out" means: in a randomized trial, individual differences in scale use (shifters) are equally distributed between treatment and control groups, so when we compare average changes, these individual differences subtract out. This is similar to how fixed effects absorb level differences. (by randomization), treatment can change the meaning of the respondent's self-evaluation. This is called response shift: changes in internal standards, values, or conceptualization that alter how respondents answer over time.[rs]Sprangers & Schwartz (1999). "Integrating response shift into health-related quality of life research." Social Science & Medicine.
For wellbeing interventions—especially psychosocial programs that may explicitly reframe cognition—response shift is a genuine concern, not merely a theoretical possibility. If treatment changes the reporting function $f_i(\cdot)$, observed $\Delta LS$ mixes "true welfare change" with "scale change," potentially biasing the WELLBY estimate in either direction.
4. A key critique: identification and transformations
Bond & Lang (2019) argue that with ordinal happiness data, comparing average wellbeing between groups (e.g., "are Germans happier than Americans?") is not identified without strong assumptions—monotonic transformations can reverse conclusions.[3]Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." JPE, 127(4). However, our focus is on intervention effects, where randomization mitigates some (but not all) of these concerns—see box below.
But our focus is intervention effect estimates, not cross-country happiness comparisons. Bond & Lang's core examples involve comparing average happiness between groups (e.g., "are Germans happier than French?"). For within-study intervention comparisons—randomized treatment vs. control, especially in LMIC RCTs—the situation is less dire: randomization ensures similar baseline scale-use distributions, so we're comparing changes within comparable groups rather than absolute levels across different populations. The identification problem resurfaces when comparing effect sizes across studies conducted in different populations, or when treatment and control groups differ systematically at baseline.
What "non-identified" means (technical detail)
A parameter is identified when data + assumptions pin down a unique value. With ordinal responses (someone reports "7"), we only know their underlying welfare falls in some range corresponding to "7"—we don't know the exact welfare level.
Concrete example: Suppose Person A reports "7" and Person B reports "6". We know A reported higher, but not how much higher in welfare terms. Consider two possible mappings from reported scores to actual welfare:
- Mapping 1 (linear): "6" = 60 utils, "7" = 70 utils → A is 10 utils better off
- Mapping 2 (concave): "6" = 50 utils, "7" = 52 utils → A is only 2 utils better off
Both mappings are consistent with the observed data (A reported higher than B). Without additional structure (like assuming the scale is linear), we cannot determine which mapping is correct. This is why average welfare comparisons are "non-identified"—the data alone cannot tell us the magnitude of the difference.
Why this matters for intervention comparison
- Cross-study synthesis: Comparing treatment effects measured on different scales can inherit the same vulnerability
- Magnitude-sensitive cost-effectiveness: Even if both Intervention A and B show positive effects, cost-effectiveness depends on how much better. If A costs twice as much but has a 2.5× larger effect, it's more cost-effective—but only if that magnitude comparison is valid.
Transformation Sensitivity Demo
What this shows: Bond & Lang's critique hinges on the fact that we only observe ordinal responses (1, 2, 3...), not the underlying welfare. Any monotonic transformation $g(x)$ that preserves order is equally consistent with the data—but different transformations can reverse which group has higher mean welfare.
What is a monotonic transformation? A function that preserves ordering—if A>B before, then g(A)>g(B) after. Examples: squaring (x²), square root (√x), or log(x). These change the spacing between values without changing which is larger.
Below, $g(x) = x^θ$ transforms raw LS scores. When θ=1 (linear), LS is used directly. When θ>1 (convex), high scores are stretched more than low scores. When θ<1 (concave), the opposite.
Try it: In "effects" mode, move θ from 1.0 toward 2.0 and watch the ranking flip. Intervention B has a larger raw effect (2 vs 1 point), but A's effect occurs at higher LS levels. Under convex transformations (θ>1), gains at high levels are amplified—so A can dominate despite smaller raw gains.
5. Scale-use heterogeneity: shifters vs. stretchers
A useful framework is the affine model, which posits that individuals differ in how they use the scale. This model separates two types of scale-use differences:[11]The "shifters vs. stretchers" framework derives from Benjamin et al. (2012, 2014, 2023). See also Oswald (2008) and Kaiser & Oswald (2022) on scale-use heterogeneity.
Here $u_{it}$ is person $i$'s true welfare at time $t$, and $LS_{it}$ is their reported life satisfaction. The parameters $a_i$ (shift) and $b_i$ (stretch) capture individual scale-use patterns.
Shifters (different $a_i$)
Different intercepts: "some people always report +2 higher." Levels are not comparable, but differences are: $\Delta u = b \cdot \Delta LS$.
Stretchers (different $b_i$)
Different slopes: "some people compress the scale." Both levels AND differences are non-comparable: if person A has $b_A=0.5$ and person B has $b_B=1.5$, identical reported changes (ΔLS=2) correspond to different welfare changes ($\Delta u_A=1$ vs $\Delta u_B=3$).
Benjamin et al. propose calibration questions to identify and adjust for scale-use heterogeneity—questions designed to have the same objective answer across respondents.[4]Benjamin et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
Shifter vs. Stretcher Demo
Compare two populations with different scale use. See why stretchers distort intervention comparisons.
Why fixed effects only remove shifts
Fixed effects absorb level differences (the $a_i$ terms). But if people have different $b_i$ (stretch factors), the implied welfare change per reported point differs.
6. Neutral point and mortality
Two different "zeros" people reference:
- Neutral wellbeing: the point where life is neither good nor bad
- Death as zero: treating "dead people score 0" for life-year calculations
For incremental comparisons among the living, the neutral point often cancels out—that is, when comparing ΔLS between intervention and control groups, both are measured from the same implicit baseline, so the zero point subtracts from both sides and doesn't affect the difference.[14]Mathematically: (LS_treatment − LS_0) − (LS_control − LS_0) = LS_treatment − LS_control. The neutral point LS_0 cancels. But for mortality comparisons—"WELLBYs from life extension = life-years × average wellbeing"—the origin is load-bearing.
Neutral Point / Mortality Demo
Scenario: Mortality intervention prevents a death, yielding 40 additional life-years at average LS = 5.
If neutral = 0: Benefit = 40 × 5 = 200 WELLBYs
If neutral = 2: Benefit = 40 × (5 − 2) = 120 "above-neutral" WELLBYs
When does the neutral point matter? When comparing interventions that affect mortality to those that don't. For comparisons among living people only, the neutral point typically cancels out: since both treatment and control are measured from the same baseline, subtracting that baseline from both sides leaves the difference unchanged.[14]
Empirical neutral point estimates
Recent work has attempted to estimate the neutral point empirically. The estimates vary widely (0.6 to 6.0 on a 0-10 scale) depending on the elicitation method and sample. HLI is advising the UK Green Book guidelines and has compiled the following estimates:[*]Table compiled by Samuel Dupret (HLI), shared March 2026. Note that different questions elicit very different values—asking about "life no longer worth living" yields lower estimates than asking about "minimally acceptable" levels.
| Source | Value | Method | Sample |
|---|---|---|---|
| Samuelsson et al. (2023) - HLI pilot | 1.26 | Asked when life is no longer worth living on 0-10 LS scale | N=79, UK |
| Samuelsson et al. (2023) - HLI pilot | 5.30 | Asked where balance between satisfied/dissatisfied on 0-10 LS | N=128, UK |
| Peasgood et al. (unpublished) | 2.00 | Time trade-offs (QALY method rather than wellbeing scale) | N=75, UK |
| IDinsight Beneficiary Survey (2019) | 0.56 | "At what point on the ladder is it worse than dying?" | N=70, Ghana & Kenya |
| Moss (Rethink Priorities, unpublished) | 2.49 | Asked level preferring alive to dead (converted 0-100 → 0-10) | N=35, likely UK |
| Moss (Rethink Priorities, unpublished) | 6.05 | Asked minimally acceptable level to live an extra year | N=101, likely UK |
| Jamison et al. (forthcoming) | 2.39 | Policy comparison: saving people from dying (0-100 → 0-10) | N=1800 (Brazil, China, UK) |
| Jamison et al. (forthcoming) | 2.54 | Policy comparison: saving people from non-existence (0-100 → 0-10) | N=1800 (Brazil, China, UK) |
Key observations:
- Question framing matters enormously: "Life no longer worth living" (1.26) vs "minimally acceptable" (6.05) yields 5× difference
- The only LMIC estimate (IDinsight, Ghana/Kenya) is the lowest at 0.56—though small sample and different question framing
- Jamison et al. provides the largest cross-country sample (N=1800 across Brazil, China, UK) with estimates around 2.4-2.5
- HM Treasury (UK Green Book) currently uses 1.0 (based on suicide rates); Frijters & Krekel recommend 2.0
7. Evidence and alternatives
Reliability: noisy but not useless
Single-item life evaluations have test-retest correlations around 0.5-0.7 over short windows.[5]Krueger & Schkade (2008). "The reliability of subjective well-being measures." This means measurement error attenuates estimated effects—small real effects may be undervalued. For relative comparisons: if measurement error is similar across interventions and contexts, attenuation affects absolute magnitudes but may preserve relative rankings. However, if noise levels differ (e.g., different survey modes, cultural contexts, or outcome domains), relative comparisons become less reliable.
Predictive validity
What this means: "Predictive validity" asks whether LS scores relate to real-world behaviors and outcomes in expected ways. If LS contains meaningful information about welfare, lower scores should predict people making costly changes (moving, quitting jobs, leaving relationships) to improve their situation.
Kaiser & Oswald show that single numeric feelings responses do predict these consequential outcomes—relationships tend to be replicable and approximately linear.[6]Kaiser & Oswald (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS. This suggests LS captures something meaningful, though it doesn't directly prove magnitudes are comparable across people.
LMIC evidence: GiveDirectly cash transfers
Kenya RCT (Haushofer & Shapiro)
- Short-run (~9 months): Life satisfaction +0.17 SD, happiness +0.16 SD (WVS measures)
- Long-run (~3 years): Life satisfaction +0.08 SD (statistically significant); psychological wellbeing index +0.16 SD
These findings show LS measures have detectable signal in LMIC RCTs, not just noise—and effects can persist.[7]Haushofer & Shapiro (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. Also 2018 long-term follow-up working paper.
The outcome translation problem
Many LMIC mental health studies report depression scales or symptom indices, not standard 0-10 life satisfaction. GiveWell's assessment of StrongMinds explicitly highlights uncertainty in translating depression improvements into life satisfaction gains.[gw]GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds." Notes substantial uncertainty in depression-to-LS mapping.
Even if you accept WELLBY as the target unit, outcome translation forces choices: use DALYs/QALYs (more standard in health evaluation) even if they miss non-health welfare, use life satisfaction directly but only where trials collect it, or use mapping models—statistical relationships estimated from observational data that predict life satisfaction from other measures (e.g., "a 1-SD reduction in PHQ-9 depression score corresponds to ~0.4 points higher LS")—but carry mapping uncertainty explicitly.
Comparison with alternatives
| Metric | Strengths | Weaknesses |
|---|---|---|
| WELLBY | Captures non-health welfare; direct self-report; low burden | Scale-use, comparability assumptions; cross-study issues |
| DALY/QALY[17]Key measurement difference: DALYs/QALYs use external disability weights derived from population surveys rating hypothetical health states—not individual self-reports of the person's own wellbeing. WELLBYs directly ask each person their life satisfaction. | Standardized; large evidence bases; direct mortality link | May miss non-health welfare; mental health disability weights contentious |
| Calibrated WELLBY | May reduce scale-use bias (magnitude context-dependent) | Complex; LMIC feasibility unclear; new assumptions |
8. WELLBY Calculator
Incremental WELLBY Estimate
Enter treatment effect, duration, recipients, and cost to estimate total WELLBYs and cost-effectiveness.
This calculator assumes constant effect size. Real applications should account for effect decay, discounting, and uncertainty.
9. Considerations for discussion
These are common practices and considerations mentioned in the literature—not workshop conclusions. One goal of the workshop is to assess which of these (if any) represent genuine best practices.
Approaches funders currently use
- Sensitivity analyses across different neutral points and scale-comparability assumptions
- Reporting how rankings change under different assumptions
- Combining WELLBY estimates with other evidence types
Approaches researchers consider
- Standardized life satisfaction questions (OECD prototype) for cross-study comparison
- Long-term follow-up to capture duration effects and potential adaptation
- Calibration questions/vignettes when comparing across populations
- Pre-registration of analysis plans including functional form choices
Conditions that may support stronger inference
- Large effect sizes that overwhelm measurement error
- Within-person designs where each person serves as their own control[19]Caveat: Within-person designs can introduce demand effects—participants may feel motivated to report improvement to please researchers. This is a particular concern for psychosocial interventions.
- Multiple wellbeing measures showing the same directional effect
- Triangulation with behavioral or objective outcomes
10. Open questions (research agenda)
High-value areas for future research that could meaningfully improve the reliability of WELLBY-based comparisons:
- Neutral point estimation: What is the actual neutral point on the 0-10 scale for different populations? How stable is it across contexts?[16]Limited LMIC-specific research exists on neutral point estimation. Peasgood et al. (2018) provide estimates for UK populations; whether these generalize to LMIC contexts is unclear.
- Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more?
- Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive burden?
- WELLBY-DALY relationship: What's the mapping between WELLBYs and DALYs, and is it linear? How much does it vary by health condition?
- Demand effects and response shift: How do experimenter demand effects and response shift vary by intervention type?
11. Workshop prompts
Neutral prompts for workshop deliberation:
1. For which classes of intervention comparisons (same setting/instrument vs. cross-study) does the linear WELLBY seem most defensible, and why?
2. Which assumptions are most likely to be materially violated in LMIC contexts: linearity, intertemporal comparability, interpersonal comparability, or scale-use heterogeneity?
3. When does the neutral point become decision-relevant? Which "zero" do you have in mind?
4. How should analysts treat "mapping" between depression scales and life satisfaction when LS isn't measured? What minimum evidence would make a mapping credible?
5. Which low-burden calibration approaches seem most promising for LMIC settings?
6. Practical recommendations: What should funders do now, given current evidence and uncertainty? What specific guidance can we offer for making decisions while better evidence is developed?
References
- Frijters, P., Clark, A.E., Krekel, C. & Layard, R. (2020). "A Happy Choice: Wellbeing as the Goal of Government." Health Economics, 29(12).
- OECD (2013/2024). Guidelines on Measuring Subjective Well-being.
- Bond, T.N. & Lang, K. (2019). "The Sad Truth about Happiness Scales." Journal of Political Economy, 127(4).
- Benjamin, D.J. et al. (2023). "Adjusting for Scale-Use Heterogeneity." NBER WP 31728.
- Krueger, A.B. & Schkade, D.A. (2008). "The reliability of subjective well-being measures." Journal of Public Economics.
- Kaiser, C. & Oswald, A.J. (2022). "The Scientific Value of Numerical Measures of Human Feelings." PNAS, 119(42).
- Haushofer, J. & Shapiro, J. (2016). "The Short-term Impact of Unconditional Cash Transfers." QJE. See also 2018 long-term follow-up.
- HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.
- Helliwell, J.F., et al. (2021). "The WELLBY." World Happiness Report 2021, Chapter 6.
- GiveWell (2023). "Our Assessment of Happier Lives Institute's Cost-Effectiveness Analysis of StrongMinds."
- Benjamin, D.J., Heffetz, O., Kimball, M.S. & Szembrot, N. (2014). "Beyond Happiness and Satisfaction: Toward Well-Being Indices Based on Stated Preference." AER, 104(9). The shifters/stretchers framework is elaborated in Benjamin et al. (2023) NBER WP 31728.
- Plant, M. (2024). "A Happy Possibility About Happiness (And Other Subjective) Scales: An Investigation and Tentative Defence of the Cardinality Thesis." Wellbeing Research Centre working paper.
- Happier Lives Institute (2023-2025). Cost-effectiveness analyses of mental health interventions using WELLBY methodology. See happierlivesinstitute.org.
- When comparing intervention effects for living people, the neutral point cancels algebraically: (LS_treatment − LS₀) − (LS_control − LS₀) = LS_treatment − LS_control.
- LMIC study examples: Haushofer & Shapiro (2016, 2018) measured SWB at 9 months and 3 years post cash transfers in Kenya; StrongMinds evaluations typically follow up at 3-6 months.
- Limited LMIC-specific research exists on neutral point estimation. Peasgood et al. (2018) provide estimates for UK populations; whether these generalize to LMIC contexts is unclear.
- DALY/QALY vs WELLBY measurement: DALYs use disability weights derived from population surveys where respondents compare hypothetical health states (GBD studies). QALYs often use patient assessments via instruments like EQ-5D. WELLBYs ask each individual to rate their own life satisfaction directly.
- "Incremental" vs "level-based" is descriptive language used in this document to distinguish two common uses of WELLBYs. The academic literature typically just refers to "WELLBYs" with context clarifying whether the focus is on intervention effects (changes) or absolute welfare levels (for mortality calculations).
- Within-person designs reduce scale-use heterogeneity across individuals but may introduce demand effects—participants may feel motivated to report improvement to please researchers, especially for psychosocial interventions with non-blinded delivery.
- "Cancels out" in randomization context: if scale-use differences (shifters and stretchers) are randomly distributed between treatment and control groups, within-study comparisons remain valid—the average effect estimate is unbiased. The problem arises when (a) comparing across studies with different scale-use distributions, or (b) treatment itself changes the reporting function (response shift, discussed above), which randomization cannot address.
Related Analysis
For discussion of how to convert between DALYs/QALYs and WELLBYs:
DALY/QALY↔WELLBY Conversion →Note: AI-generated draft requiring verification