The Unjournal · Pivotal Questions Initiative

About This Workshop

Why we're bringing researchers, evaluators, and funders together to discuss how we measure and compare wellbeing across interventions.

💬 Annotate this page — select any text to comment via Hypothes.is (free account to post; anyone can read)

The problem

Organizations ranging from Effective Altruism-aligned funders like Founders Pledge, GiveWell, and Coefficient Giving to government agencies and development NGOs compare interventions across very different domains—physical health, mental health, poverty alleviation—to decide where resources can do the most good. To make these comparisons, they need a common unit of measurement.

Two measures feature prominently in these analyses.[9]Other measures include QALYs (quality-adjusted life years), income-equivalent measures, and multi-dimensional poverty indices. QALYs are similar to DALYs but measure health gained rather than lost. The DALY (disability-adjusted life year)[1]Developed for the Global Burden of Disease Study (Murray & Lopez, 1996). One DALY represents one year of healthy life lost—either through premature death (YLL) or living with disability (YLD). Unlike WELLBYs, DALYs are based on expert-derived disability weights rather than self-reported wellbeing—weights are constructed through surveys of health professionals rating hypothetical health states. comes from health economics and captures years of healthy life lost to disease or disability. The WELLBY (wellbeing-adjusted life year)[2]Introduced by Frijters et al. (2020). One WELLBY equals a 1-point increase on the 0–10 life satisfaction scale, for one person, for one year. This assumes scale points are meaningful and comparable—the "linear WELLBY" assumption. is based on self-reported life satisfaction, typically measured on a 0–10 scale. Each has strengths and limitations—and how they relate to each other, and whether either reliably captures what matters for human welfare, directly affects which interventions get prioritized.

This is part of The Unjournal's Pivotal Questions initiative: working with impact-focused organizations to identify their highest-value research questions, connect them to evidence, and commission expert evaluations that can inform real decisions.

What sparked this workshop

This workshop emerged from converging work streams. First, we collaborated with Founders Pledge to identify their highest-value research questions—Pivotal Questions where credible evidence could most shift their funding decisions. WELLBY reliability and DALY-WELLBY interconvertibility ranked among the most decision-relevant.

Second, our evaluation of StrongMinds—a mental health intervention whose cost-effectiveness depends heavily on WELLBY measurement—highlighted practical stakes: how you interpret self-reported life satisfaction changes can swing an intervention from "highly effective" to "uncertain."[3]StrongMinds delivers group interpersonal therapy (IPT) for depression in sub-Saharan Africa. Cost-effectiveness estimates vary by an order of magnitude depending on how WELLBYs are valued relative to DALYs. See our evaluation summary for details. This sharpened the need for clarity on WELLBY validity.

Third, we commissioned an evaluation of Benjamin, Cooper, Heffetz, Kimball & Zhou's paper "Adjusting for Scale-Use Heterogeneity in Self-Reported Well-Being." This paper addresses whether people use wellbeing scales in comparable ways. If differences in scale use and preference heterogeneity[4]Scale-use heterogeneity: different people interpret and use the 0–10 scale differently. One person's "7" might correspond to another person's "5" in terms of actual experienced wellbeing. Preference heterogeneity means people may value the same life circumstances differently. Both create noise and potential bias in cross-person comparisons. mean that changes in life satisfaction (not just absolute levels) aren't comparable across individuals, that poses a challenge for the WELLBY as a tool for comparing interventions. The paper develops methods using calibration questions[5]Questions asking respondents to rate well-defined scenarios (e.g., "How satisfied would you be if you won $1,000?"). By observing how people rate the same reference points, researchers can estimate and adjust for individual differences in scale use. and vignette exercises[6]Respondents rate hypothetical people's life satisfaction based on descriptions. E.g., "Sarah has a stable job and good health. Rate her life satisfaction." This reveals how individuals anchor the scale, enabling cross-person calibration. to detect and adjust for scale-use heterogeneity. The evaluators' verdict was encouraging but nuanced: the differences in scale use may not be as severe as some feared, but more work is needed—particularly on whether the calibration methods generalize to low-income settings and whether scale-use heterogeneity differs systematically across treatment and control groups.

Together, these considerations led us to propose this workshop to Founders Pledge, who agreed it would be valuable—bringing together researchers, evaluators, and funders to make progress on questions that directly affect funding priorities.

What we want to achieve

This workshop brings together authors of several papers in this area, Unjournal evaluators, funders who use these measures in their work, and researchers with relevant expertise. We're organizing the discussion around four key questions:

1. Is the linear WELLBY reliable enough?

Can we treat a 1-point improvement in life satisfaction as meaning the same thing for different people and starting points?[7]This is the "cardinality" assumption: that scale intervals are equal (3→4 = 7→8). Without it, summing WELLBYs across people is invalid. Evidence is mixed—some studies suggest diminishing sensitivity at higher levels. Does improving one person's wellbeing from 1→3 equal improving two people's wellbeing from 1→2? Does a move from 3→4 mean the same as 7→8? Where is the "neutral point"[8]The neutral point is the life satisfaction level representing neither positive nor negative welfare—essentially the boundary between "life worth living" and "suffering." Estimates range from 2-5 on the 0-10 scale. Peasgood et al. (2018) tentatively estimate ~2. on the scale? Does any of this matter—is the linear WELLBY likely to yield similar recommendations as other methods when comparing interventions?

2. How should we convert between DALYs/QALYs and WELLBYs?

Current approaches are rough. A 1 SD change in WELLBY is often treated as equivalent to ~1 SD in DALYs (or QALYs), but is this defensible? How sensitive are funding decisions to the conversion factor used?

3. Could methodological adjustments improve things?

Benjamin et al. provide evidence suggesting that calibration questions and vignette exercises may reduce bias from scale-use differences and preference heterogeneity. Should funders encourage these methods in future RCTs? Adding such instruments comes at a cost—increased survey length, respondent burden, and comprehension challenges—so the benefits must be weighed. Are there other refinements—such as multi-item scales—that could help?

4. What should funders do now?

When comparing interventions across domains—where one might be measured in WELLBYs and another in DALYs—what's the defensible approach? (Note: either type of intervention could in principle be measured using either approach.) What conversion factors and uncertainty ranges should impact-focused organizations use today—and what would change their minds?

How the workshop is structured

Note: The exact structure is still being determined based on participant feedback. We will announce the precise agenda before the workshop date, so you can drop in for just the segments that interest you.

The workshop is fully online, with approximately 3.5 hours of live sessions scheduled in segments so you can join only the parts you're interested in. We also support asynchronous participation—you can submit beliefs and comments before or after the live event, and we'll integrate these into the discussion.

Proposed segments (unfold) — see also the Live Sessions page for the interactive workshop structure.

Stakeholder Problem Statement & Pivotal Questions (~25 min): Representatives from Coefficient Giving and Founders Pledge (~10 min each) explain why this matters for their work—how they currently weigh WELLBYs vs DALYs in cost-effectiveness analyses and what uncertainties they face. Then we introduce the key Pivotal Questions (~5 min) and invite initial belief estimates from participants who have bandwidth.
Paper Presentation: Benjamin et al. (~25 min): The research team presents their findings on scale-use heterogeneity in self-reported wellbeing—how people use satisfaction scales differently, and what calibration methods can do about it.
Evaluator Responses & Discussion (~25 min): Our independent evaluators share their assessment of the paper's methodology and findings, followed by author responses and open discussion.
WELLBY Reliability Discussion (~25 min): Focused discussion on whether the linear WELLBY is reliable enough for comparing interventions. Covers cardinality assumptions, neutral points, and measurement challenges.
DALY/QALY↔WELLBY Conversion (~25 min): How should we translate between health measures (DALYs, QALYs) and subjective wellbeing (WELLBYs)? Examines current approaches and what's missing.
Beliefs Elicitation (~15 min): A guided exercise where participants state their probabilities on key operationalized questions (WELL_01, DALY_01, etc.), capturing expert views before and after discussion.
Practitioner Panel & Open Discussion (~30 min): Representatives from Founders Pledge, Coefficient Giving, and other organizations discuss how they currently handle WELLBY-DALY comparisons, what they'd need to change their approach, and concrete recommendations. Aim: actionable takeaways that organizations can apply immediately.

We plan to record the workshop and make it publicly available by default, with an AI-queryable transcript so researchers and funders can easily search the discussion. Participants can opt out of recording for specific segments if needed, and we will ask for final approval before posting anything.

Outputs: We hope to produce a practitioner-focused summary document, belief elicitation results with confidence intervals, and structured notes. We will share outputs with interested organizations who couldn't attend live, so the discussion can inform decisions beyond those in the room.

Pivotal Questions & Beliefs

As part of this project, we've developed specific, operationalized questions (codes WELL_01–07 on WELLBY reliability, DALY_01–05 on interconvertibility) designed so that experts and stakeholders can state their beliefs quantitatively—and so that answers can directly inform funding decisions.[10]For full specifications, see the canonical formulations on Coda. Three of these questions are also posted on our Metaculus forecasting page. We want to elicit beliefs before, during, and after reviewing the evidence, to see how expert and stakeholder views evolve.

See key questions and share your beliefs →

Confirmed Participants

Participant Affiliation Role
Dan Benjamin UCLA / NBER Presenter (Benjamin et al.)
Miles Kimball CU Boulder Presenter (Benjamin et al.)
Julian Jamison University of Exeter Presenter (DALY-WELLBY conversion)
Caspar Kaiser University of Warwick Discussant (WELLBY barriers)
Peter Hickman Coefficient Giving Stakeholder presenter, Practitioner panel
Matt Lerner Founders Pledge Stakeholder presenter, Practitioner panel
Christian Krekel London School of Economics Participant
Daniel Rogger World Bank Group Participant
Anthony Lepinteur University of Luxembourg Participant
Loren Fryxell City St George's, University of London Participant
Anirudh Tagat Participant
Zhuoran Du University of New South Wales Participant
Valentin KlotzbĂĽcher University of Basel Organizer (Unjournal)
David Reinstein The Unjournal Organizer
Yaniv Reingewertz JDC Myers Brookdale Institute Participant
Francesco Ramponi Harvard University Participant (UJ team)
Anthony Rowett Independent Participant
Sarah Reynolds Independent Participant
Alberto Prati Async contributor

Interested in joining? A few spots remain for active participants. Fill out the interest form →

Notes

  1. Developed for the Global Burden of Disease Study (Murray & Lopez, 1996). One DALY represents one year of healthy life lost—either through premature death (YLL) or living with disability (YLD). Unlike WELLBYs, DALYs are based on expert-derived disability weights rather than self-reported wellbeing—weights are constructed through surveys of health professionals rating hypothetical health states.
  2. Introduced by Frijters et al. (2020). One WELLBY equals a 1-point increase on the 0–10 life satisfaction scale, for one person, for one year. This assumes scale points are meaningful and comparable—the "linear WELLBY" assumption.
  3. StrongMinds delivers group interpersonal therapy (IPT) for depression in sub-Saharan Africa. Cost-effectiveness estimates vary by an order of magnitude depending on how WELLBYs are valued relative to DALYs. See our evaluation summary for details.
  4. Scale-use heterogeneity: different people interpret and use the 0–10 scale differently. One person's "7" might correspond to another person's "5" in terms of actual experienced wellbeing. Preference heterogeneity means people may value the same life circumstances differently. Both create noise and potential bias in cross-person comparisons.
  5. Questions asking respondents to rate well-defined scenarios (e.g., "How satisfied would you be if you won $1,000?"). By observing how people rate the same reference points, researchers can estimate and adjust for individual differences in scale use.
  6. Respondents rate hypothetical people's life satisfaction based on descriptions. E.g., "Sarah has a stable job and good health. Rate her life satisfaction." This reveals how individuals anchor the scale, enabling cross-person calibration.
  7. This is the "cardinality" assumption: that scale intervals are equal (3→4 = 7→8). Without it, summing WELLBYs across people is invalid. Evidence is mixed—some studies suggest diminishing sensitivity at higher levels.
  8. The neutral point is the life satisfaction level representing neither positive nor negative welfare—essentially the boundary between "life worth living" and "suffering." Estimates range from 2-5 on the 0-10 scale. Peasgood et al. (2018) tentatively estimate ~2.
  9. Other measures include QALYs (quality-adjusted life years), income-equivalent measures, and multi-dimensional poverty indices. QALYs are similar to DALYs but measure health gained rather than lost.
  10. For full specifications, see the canonical formulations on Coda. Three of these questions are also posted on our Metaculus forecasting page.