The Unjournal · Pivotal Questions Initiative

About This Workshop

Monday, March 16, 2026 · 11am–4pm ET / 3pm–8pm UK

Why we're bringing researchers, evaluators, and funders together to discuss how we measure and compare wellbeing across interventions.

✓ The workshop took place on March 16, 2026

We're continuing the discussion asynchronously and sharing key materials here. This site is evolving into a resource page.
Read the survey results & synthesis → · Complete the beliefs elicitation · Join the discussion (Google Doc)

View archived version from workshop day

💬 Annotate this page — select any text to comment via Hypothes.is (free account to post; anyone can read)

The problem

Organizations ranging from Effective Altruism-aligned funders like Founders Pledge, GiveWell, and Coefficient Giving to government agencies and development NGOs compare interventions across very different domains—physical health, mental health, poverty alleviation—to decide where resources can do the most good. To make these comparisons, they need a common unit of measurement.

Two measures feature prominently in these analyses.[9]QALYs: patients report health states (EQ-5D) → mapped to utility values via population-derived tariffs. Used in clinical cost-effectiveness (NICE). DALYs: disability weights from population paired-comparison surveys → applied to epidemiological data. Used in global health (GBD, GiveWell, DCP3). Key differences include perspective, whose preferences count, and data requirements... [click for more] The DALY (disability-adjusted life year)[1]GBD Study (Murray & Lopez, 1996). One DALY = one year of healthy life lost (YLL + YLD). Two-stage process: (1) Valuation—disability weights from population surveys comparing paired hypothetical states; (2) Measurement—weights applied to epidemiological prevalence data (disease registries, DHS/MICS surveys, vital statistics). This "top-down" approach makes DALYs tractable across countries... [click for more] comes from health economics and captures years of healthy life lost to disease or disability. The WELLBY (wellbeing-adjusted life year)[2]Introduced by Frijters et al. (2020). One WELLBY equals a 1-point increase on the 0–10 life satisfaction scale, for one person, for one year. This assumes scale points are meaningful and comparable—the "linear WELLBY" assumption. is based on self-reported life satisfaction, typically measured on a 0–10 scale. Each has strengths and limitations—and how they relate to each other directly affects which interventions get prioritized. For instance, different WELLBY-to-DALY conversion factors can change whether a mental health program in sub-Saharan Africa ranks above or below malaria prevention in cost-effectiveness analyses.

Why DALYs, not QALYs? (unfold)

This workshop focuses on DALYs as the health-economic comparator because they dominate in the global health and LMIC intervention contexts where impact-focused funders operate:

QALYs remain important in high-income clinical settings (NICE cost-effectiveness thresholds, FDA value assessments), but when comparing a mental health intervention in Uganda to malaria bednets in Kenya, DALYs are the practical common currency. The methodological questions about WELLBY-to-health-metric conversion apply similarly to both DALYs and QALYs, so insights from this workshop should transfer.

This is part of The Unjournal's Pivotal Questions initiative: working with impact-focused organizations to identify their highest-value research questions, connect them to evidence, and commission expert evaluations and stakeholder input that can inform real decisions.

What sparked this workshop

This workshop emerged from converging work streams. First, we collaborated with Founders Pledge to identify their highest-value research questions—Pivotal Questions where credible evidence could most shift their funding decisions. WELLBY reliability and DALY-WELLBY interconvertibility ranked among the most decision-relevant.

Second, we commissioned an evaluation of HLI's cost-effectiveness analysis of StrongMinds and Friendship Bench—mental health interventions whose estimated impact depends heavily on WELLBY measurement. This highlighted practical stakes: how you interpret self-reported life satisfaction changes can swing an intervention from "highly effective" to "uncertain."[3]StrongMinds delivers group interpersonal therapy (IPT) for depression in sub-Saharan Africa. Cost-effectiveness estimates vary substantially depending on analytical choices: GiveWell's assessment suggests a range from ~5% to ~80% as cost-effective as marginal AMF funding, roughly a 16× spread. See our evaluation summary and the GiveWell assessment. This sharpened the need for clarity on WELLBY validity.

Third, we commissioned an evaluation of Benjamin, Cooper, Heffetz, Kimball & Zhou's paper "Adjusting for Scale-Use Heterogeneity in Self-Reported Well-Being." This paper addresses whether people use wellbeing scales in comparable ways. If differences in scale use and preference heterogeneity[4]Scale-use heterogeneity: different people interpret and use the 0–10 scale differently. One person's "7" might correspond to another person's "5" in terms of actual experienced wellbeing. Preference heterogeneity means people may value the same life circumstances differently. Both create noise and potential bias in cross-person comparisons. mean that changes in life satisfaction (not just absolute levels) aren't comparable across individuals, that poses a challenge for the WELLBY as a tool for comparing interventions. The paper develops methods using calibration questions[5]Survey items with objectively correct answers—the same for all respondents. Benjamin et al. use visual calibration (e.g., "How dark is this circle?") to reveal individual scale-use tendencies without text-interpretation issues. and vignette exercises[6]Unlike traditional anchoring vignettes (rating hypothetical people), Benjamin et al.'s vignettes ask respondents to imagine situations in their own life and rate dimensions of well-being for themselves. Part of an "ideal approach" that also requires multi-dimensional wellbeing questions and stated-preference surveys. to detect and adjust for scale-use heterogeneity. The evaluators' verdict was encouraging but nuanced: the differences in scale use may not be as severe as some feared, but key questions remain—particularly whether calibration methods tested in high-income samples work in LMIC contexts, and whether treatment and control groups use scales differently (which could bias effect estimates even when calibration is applied).

Together, these considerations led us to propose this workshop to Founders Pledge, who agreed it would be valuable—bringing together researchers, evaluators, and funders to make progress on questions that directly affect funding priorities.

🔬 The Focal Case: StrongMinds vs AMF

The practical question motivating this workshop: How should funders compare mental health interventions (StrongMinds, Friendship Bench) measured in WELLBYs against bednet distribution (AMF) measured in DALYs? HLI estimates ~40 WELLBYs per $1,000 for StrongMinds; GiveWell's analysis puts it at 5–80% as cost-effective as AMF—a 16× range driven by methodological choices.

Read the detailed focal case analysis →

What we want to achieve

This workshop aims to bridge the gap between researchers and practitioners on wellbeing measurement—enabling better communication, identifying points of consensus and key uncertainties, and ultimately driving better funding decisions.[11]Theory of change: Help researchers understand practitioners' highest-value questions; help practitioners understand relevant research; enable collaboration by agreeing on terminology and identifying cruxes; state beliefs openly with calibrated uncertainty; drive better decisions on LMIC intervention measurement and funding. We're organizing the discussion around four key questions:

1. Is the linear WELLBY reliable enough?

Can we treat a 1-point improvement in life satisfaction as meaning the same thing for different people and starting points?[7]This is the "cardinality" assumption: that scale intervals are equal (3→4 = 7→8). Without it, summing WELLBYs across people is invalid. Evidence is mixed—some studies suggest diminishing sensitivity at higher levels. Does improving one person's wellbeing from 1→3 equal improving two people's wellbeing from 1→2? Does a move from 3→4 mean the same as 7→8? Where is the "neutral point"[8]The neutral point is where wellbeing is neither positive nor negative. Why it matters: If we weight gains below vs. above this threshold differently, its location directly affects which interventions look most cost-effective—and it affects DALY-WELLBY conversion. Estimates: 2-5 on 0-10 scale... [click for more] on the scale? Does any of this matter—is the linear WELLBY likely to yield similar recommendations as other methods when comparing interventions?

2. How should we convert between DALYs/QALYs and WELLBYs?

Current approaches are rough. A 1 SD change in WELLBY is often treated as equivalent to ~1 SD in DALYs (or QALYs), but is this defensible? How sensitive are funding decisions to the conversion factor used?

3. Could methodological adjustments improve things?

Benjamin et al. provide evidence suggesting that calibration questions and vignette exercises may reduce bias from scale-use differences and preference heterogeneity. Should funders encourage these methods in future RCTs? Adding such instruments comes at a cost—increased survey length, respondent burden, and comprehension challenges—so the benefits must be weighed. Are there other refinements—such as multi-item scales—that could help?

4. What should funders do now?

When comparing interventions across domains—where one might be measured in WELLBYs and another in DALYs—what's the defensible approach? (Note: either type of intervention could in principle be measured using either approach.) What conversion factors and uncertainty ranges should impact-focused organizations use today?

How the workshop is structured

Note: The exact structure is still being determined based on participant feedback. We will announce the precise agenda before the workshop date, so you can drop in for just the segments that interest you.

The workshop is fully online, with approximately 3.5 hours of live sessions scheduled in segments so you can join only the parts you're interested in. We also support asynchronous participation—you can submit beliefs and comments before or after the live event, and we'll integrate these into the discussion.

Proposed segments (unfold) — see also the Live Sessions page for the interactive workshop structure.

Stakeholder Problem Statement & Pivotal Questions (~25 min): Representatives from Coefficient Giving and Founders Pledge (~10 min each) explain why this matters for their work—how they currently weigh WELLBYs vs DALYs in cost-effectiveness analyses and what uncertainties they face. Then we introduce the key Pivotal Questions (~5 min) and invite initial belief estimates from participants who have bandwidth.
Research Presentation: Benjamin et al. (~25 min): The research team presents their findings on scale-use heterogeneity in self-reported wellbeing—how people use satisfaction scales differently, and what calibration methods can do about it.
Evaluator Responses & Discussion (~25 min): Our independent evaluators share their assessment of the paper's methodology and findings, followed by author responses and open discussion.
WELLBY Reliability Discussion (~25 min): Focused discussion on whether the linear WELLBY is reliable enough for comparing interventions. Covers cardinality assumptions, neutral points, and measurement challenges.
DALY/QALY↔WELLBY Conversion (~25 min): Samuel Dupret (HLI) presents HLI's approach to DALY-WELLBY conversion and moral weights; Julian Jamison (University of Exeter) leads discussion. Examines current approaches and what empirical evidence would most reduce uncertainty.
Beliefs Elicitation (~15 min): A guided exercise where participants state their probabilities on key operationalized questions (WELL_01, DALY_01, etc.), capturing expert and stakeholder views before and after discussion.
Practitioner Panel & Open Discussion (~30 min): Representatives from Founders Pledge, Coefficient Giving, and other organizations discuss how they currently handle WELLBY-DALY comparisons, what they'd need to change their approach, and concrete recommendations. Aim: actionable takeaways that organizations can apply immediately.

We plan to record the workshop and make it publicly available by default, with an AI-queryable transcript so researchers and funders can easily search the discussion. Participants can opt out of recording for specific segments if needed, and we will ask for final approval before posting anything.

Outputs: We hope to produce a practitioner-focused summary document, belief elicitation results with confidence intervals, and structured notes. We will share outputs with interested organizations who couldn't attend live, so the discussion can inform decisions beyond those in the room.

Pivotal Questions & Beliefs

See key questions and share your beliefs →

As part of this project, we've developed specific, operationalized questions (codes WELL_01–07 on WELLBY reliability, DALY_01–05 on interconvertibility) designed so that experts and stakeholders can state their beliefs quantitatively—and so that answers can directly inform funding decisions.[10]For full specifications, see the canonical formulations on Coda. Three of these questions are also posted on our Metaculus forecasting page. [links in footnote] We want to elicit beliefs before, during, and after reviewing the evidence, to see how expert and stakeholder views evolve.

Pre-Read Resources: Framing the Discussion

To help participants get on the same page before the workshop, we've prepared two analysis documents that map the key issues, assumptions, and evidence. These aren't meant to resolve debates—they're meant to structure them: "Here are the important issues; let's discuss them one by one."

📄 Linear WELLBY Analysis

When is the "linear WELLBY" defensible for comparing interventions? Covers the core assumptions (cardinality, comparability, neutral point), the Bond & Lang identification critique, scale-use heterogeneity (shifters vs. stretchers), and what calibration methods can and can't fix.

Read the analysis →

📄 DALY↔WELLBY Conversion

How should we translate between DALYs/QALYs and WELLBYs? Reviews empirical anchors (UK Green Book ~7:1, empirical estimates 5–10), conceptual issues (health vs. subjective wellbeing), and practical implications for CEA decisions.

Read the analysis →

Note: These documents were AI-assisted drafts (March 2026) that integrate deep research and annotation feedback. They're shared for workshop discussion—please annotate errors or concerns directly on the pages.

📚 Full recommended readings list → — includes Benjamin et al. (2023), Plant (2024), Bond & Lang (2019), GiveWell/HLI analyses, and papers by workshop participants
📝 Workshop transcript summary → — edited summary with direct quotes from all sessions

🎬 Workshop highlights reel — a short overview of key moments. See the full transcript summary for more.

Confirmed Participants (RSVPs)

See confirmed participants — 27 live + 2 async (show)

Participant Affiliation Role
Dan Benjamin UCLA / NBER Presenter (Benjamin et al.)
Miles Kimball CU Boulder Presenter (Benjamin et al.)
Julian Jamison University of Exeter Presenter (DALY-WELLBY conversion)
Caspar Kaiser University of Warwick Discussant (WELLBY barriers)
Peter Hickman Coefficient Giving Stakeholder presenter, Practitioner panel
Matt Lerner Founders Pledge Stakeholder presenter, Practitioner panel
Christian Krekel London School of Economics Participant
Daniel Rogger World Bank Group Participant
Anthony Lepinteur University of Luxembourg Participant
Loren Fryxell City St George's, University of London Participant
Anirudh Tagat Participant
Zhuoran Du University of New South Wales Participant
Valentin Klotzbücher University of Basel Organizer (Unjournal)
David Reinstein The Unjournal Organizer
Yaniv Reingewertz University of Haifa / Myers-JDC-Brookdale Participant
Francesco Ramponi Harvard University Participant (UJ team)
Anthony Rowett Independent Participant
Sarah Reynolds Independent Participant
Ori Heffetz Cornell University Presenter (beliefs elicitation, SWB measurement)
Kristen Cooper Gordon College Participant (Benjamin et al. co-author)
Mo Putera ARMoR Participant (CEA practitioner)
Michael Plant Happier Lives Institute / Oxford Discussant (all segments)
Samuel Dupret Happier Lives Institute Presenter: DALY-WELLBY conversion
Joel McGuire Happier Lives Institute Moral weights / DALY-WELLBY
Kevin Lang Boston University Participant (co-author, Bond & Lang)
Steven Laufer Participant
Johanna Salu University of Warwick Participant (PhD student)
Alberto Prati Async contributor
Huw Evans Kaya Guides Async contributor (WELLBY practitioner)

Interested in joining? A few spots remain for active participants. Fill out the interest form →

Other Pivotal Questions Workshops

🥩 Cultivated Meat (Apr 2026) 🥗 Plant-Based Alternatives (May 2026) All Workshops →

Notes

  1. Developed for the Global Burden of Disease Study (Murray & Lopez, 1996). One DALY represents one year of healthy life lost—either through premature death (YLL) or living with disability (YLD). The DALY framework separates two steps: (1) Valuation—disability weights (0–1 scale, where 1 = death-equivalent) come from population surveys where respondents compare paired hypothetical health states ("which person is healthier?"), not from patients rating their own conditions; (2) Measurement—these standardized weights are applied to epidemiological prevalence data (disease registries, hospital records, household surveys like DHS/MICS, vital statistics, cause-of-death registration) to estimate population-level burden. This "top-down" approach—using standardized weights applied to surveillance data—makes DALYs tractable for comparing interventions across countries with variable health infrastructure. See detailed definitions →
  2. Introduced by Frijters et al. (2020). One WELLBY equals a 1-point increase on the 0–10 life satisfaction scale, for one person, for one year. This assumes scale points are meaningful and comparable—the "linear WELLBY" assumption.
  3. StrongMinds delivers group interpersonal therapy (IPT) for depression in sub-Saharan Africa. Cost-effectiveness estimates are highly sensitive to assumptions about WELLBY measurement validity and WELLBY-to-DALY conversion factors. See our evaluation summary for details.
  4. Scale-use heterogeneity: different people interpret and use the 0–10 scale differently. One person's "7" might correspond to another person's "5" in terms of actual experienced wellbeing. Preference heterogeneity means people may value the same life circumstances differently. Both create noise and potential bias in cross-person comparisons.
  5. Calibration questions are survey items with objectively correct answers—the same for all respondents. Benjamin et al.'s innovation uses visual calibration: e.g., "How dark is this circle?" on a 0–100 scale. Because the true answer is identical for everyone, differences in responses reveal individual scale-use tendencies. This avoids interpretation issues that arise with text-based scenarios.
  6. Vignette calibration in Benjamin et al. does not use traditional "anchoring vignettes" where you rate hypothetical people (e.g., "Rate Sarah's life satisfaction"). They critique that approach—respondents may lack empathy or project their own circumstances. Instead, their vignettes ask respondents to imagine specific situations in their own life and rate a dimension of well-being for themselves (e.g., "your ability to remember things" given a hypothetical constraint). These methods are part of an "ideal approach" that also requires collecting self-reported wellbeing across multiple dimensions plus stated-preference surveys.
  7. This is the "cardinality" assumption: that scale intervals are equal (3→4 = 7→8). Without it, summing WELLBYs across people is invalid. Evidence is mixed—some studies suggest diminishing sensitivity at higher levels.
  8. The neutral point is the life satisfaction level representing neither positive nor negative welfare—the boundary between "life worth living" and "suffering." Why it matters: If we weight gains to people below vs. above this threshold differently (e.g., prioritizing those in negative welfare states), the neutral point location directly affects which interventions look most cost-effective. The point also affects DALY-WELLBY conversion, since DALYs implicitly assume death-equivalent = 0. Estimates range from 2-5 on the 0-10 scale; Peasgood et al. (2018) tentatively estimate ~2.
  9. QALYs (quality-adjusted life years) measure health gains: one QALY = one year in perfect health. QALYs have a two-stage process: (1) patients describe their health state via instruments like EQ-5D (rating mobility, self-care, usual activities, pain, anxiety/depression); (2) these descriptions are mapped to utility values via scoring tariffs derived from general population valuation exercises (time trade-off, standard gamble, discrete choice). QALYs dominate in clinical cost-effectiveness analysis (e.g., NICE uses £20,000–30,000/QALY thresholds). DALYs measure health losses using a different architecture: disability weights come from population surveys comparing paired hypothetical states (not patient self-assessments), then applied top-down to epidemiological data. Key differences: (a) Perspective—QALYs evaluate specific treatments for specific patients; DALYs compare population-level burden across countries; (b) Whose preferences—DALY weights reflect general public assessments, while QALY values can reflect patient or public perspectives (these systematically diverge due to hedonic adaptation—patients often rate conditions less severely than the public imagines); (c) Data requirements—QALYs need patient-reported outcomes; DALYs can be estimated from surveillance data, making them tractable in LMICs. Other measures include income-equivalent metrics and multi-dimensional poverty indices. See detailed definitions →
  10. For full specifications, see the canonical formulations on Coda. Several of these questions are also posted on our Metaculus forecasting page.
  11. Theory of change: Help researchers understand practitioners' highest-value questions; help practitioners understand relevant research; enable collaboration by agreeing on terminology and identifying cruxes; state beliefs openly with calibrated uncertainty; drive better decisions on LMIC intervention measurement and funding.

Interested in future async participation? We'll be continuing this discussion after the workshop. Let us know your interest to receive updates on follow-up opportunities.

About this page

Some content on these workshop pages was drafted with AI assistance based on our own source materials, notes, and research. We've reviewed it carefully, but errors or unclear passages may remain. If you spot anything incorrect or confusing, please let us know—ideally via a Hypothes.is annotation on the relevant passage, or by emailing contact@unjournal.org.