Wellbeing Pivotal Questions

1. WELLBY Reliability and Value

How reliable is the linear WELLBY measure for comparing interventions?

PQ1a · WELLBY Usefulness · WELL_01/07

How reliable is the linear WELLBY measure [...] relative to other available measures in the 'wellbeing space'? How much insight is lost by using linear WELLBY and when will it steer us wrong?

Adapted from WELL_07: "How reliable is the WELLBY measure of well-being/mental health (as defined above) relative to other available measures in the 'wellbeing space' (including other transformations of the 0-10 life satisfaction scale)?"

The WELLBY is used by several major funders[2]Happier Lives Institute uses WELLBYs as their primary metric; Founders Pledge incorporates them alongside DALYs. GiveWell has explored WELLBY-based analysis but hasn't fully adopted it. (Happier Lives Institute, Founders Pledge) to compare interventions across domains. The reliability of this approach matters for resource allocation decisions.

For present purposes, a WELLBY is defined as an increase of 1 point for one person for one year on a 0–10 life satisfaction scale (e.g. Cantril's Ladder). (Note: definitions vary across contexts.)
Benjamin et al. (2023) found substantial scale-use heterogeneity.[3]Scale-use heterogeneity: different people use the 0-10 scale differently—what one person calls "7" might correspond to another's "5." This creates bias when comparing across individuals or groups. They develop methods using calibration questions[4]Calibration questions have objectively correct answers (e.g., "1+1=?") that reveal how respondents use scales. If someone rates 2 as "very certain," we know they compress the scale. (new survey items with objectively correct answers) and vignette exercises[5]Vignettes describe hypothetical people ("John has X, Y, Z characteristics"). By having respondents rate these standardized scenarios, researchers can compare individual scale use. (rating hypothetical scenarios) that can reduce the estimated bias from scale-use heterogeneity—by roughly 30–50% in their sample, though generalizability remains uncertain.
Related canonical question (WELL_01): "What combination of (a) subjective wellbeing survey data [...], (b) income and health-outcome data, (c) metrics based on this data (e.g., linear or logarithmic WELLBYs, standard deviations, scale-use adjustments), and (d) possible conversions between different measures would be 'best' for making funding choices between interventions which may impact mental health, physical health, and/or consumption[?]"

Consider the broad question above about relative WELLBY reliability for contexts like the focal context.

Operationalized version

What share of this value is obtained, in expectation, from using the simple linear WELLBY measure (as defined above) for all interventions?

For example: 1.0 = WELLBY captures all the value of the best measure; 0.5 = WELLBY captures half the value; 0 = WELLBY provides no useful information for these comparisons.

Your central estimate (expected value)

0.70

Lower bound of your 80% credible interval (10th percentile)

0.40

Upper bound of your 80% credible interval (90th percentile)

0.90

An 80% credible interval represents the range you believe has an 80% probability of containing the true value. There should be roughly a 10% chance the true value is below your lower bound, and a 10% chance it's above your upper bound.

Wide intervals indicate high uncertainty; narrow intervals indicate confidence. Both are informative—we want to understand the range of expert views.

State your reasoning and key considerations here.

PQ1b · Best Measure · WELL_02/03

Given the available collected data [...], how should [funders] measure the impact on wellbeing? [...] What measures of well-being should charities, NGOs, and RCTs collect for impact analysis?

Even if the WELLBY is "good enough," there might be better options—multi-item scales, log-transformed life satisfaction, or standardized composites. Switching measures has costs, so the improvement needs to be meaningful.

Adapted from WELL_02: "Given the available collected data from surveys and intervention trials, how should Founders' Pledge measure the impact on wellbeing in the context of mental health interventions? [...] Consider reliability, insight, and practicability."

And WELL_03: "What measures of well-being [...] should charities, NGOs, and RCTs collect for impact analysis, particularly in contexts that may involve less tangible well-being outcomes (such as mental health interventions)? This could also include stated-preference and calibration surveys."

Candidates include: multi-item life satisfaction scales (e.g. SWLS), experience sampling, the WB-Pro, WEMWBS, log-transformed 0-10 LS, or domain-specific instruments.
Diener et al. (2018) found single-item life satisfaction has moderately high reliability (~0.70 correlation) with little validity loss compared to multi-item scales.
WELL_03 also asks: "How should these [measures] be used?"—considering not just what to collect but how to combine and interpret the data.

Consider both questions above: what measures to use given available data, and what measures should be collected going forward.

Operationalized version

Which measure would you recommend as the primary metric, assuming future RCTs can collect more data and responses (at a cost)?

Briefly explain your choice and any caveats.

WELL_01a · Cost Ratio Extension

If you propose a measure other than linear WELLBY in your answer above, how much more would it cost to achieve the same welfare improvement using linear WELLBY instead?

Consider the welfare-improvement from allocating $100,000 among a large set of charities/interventions given the information provided by the "best measure" you propose. How much more would it cost to achieve the same outcome using the linear WELLBY? (E.g., 1.1 = 10% more, 1.5 = 50% more, 3 = 3x as much.) If you think WELLBY is optimal, skip this question. This is inherently speculative—rough estimates based on your intuition are welcome.

Cost ratio

Brief reasoning (optional)

WELL_04 · Single vs Combined Measures

In contexts where interventions impact mental health, physical health, AND consumption: is it better to use a single WELLBY measure, or measure each dimension separately and then convert/combine?

Your view

In which contexts might one approach be better than the other? (optional)

WELL_07 · What Is Lost?

How much insight is lost by using WELLBY relative to other available measures in the "wellbeing space"? When will it steer us wrong?

WELL_08 · Life Satisfaction vs Experience · Metaculus →

Would it be better to base the metric on life satisfaction or instantaneous experience measures (e.g., happiness, affect balance)?

Brief reasoning (optional)

WELL_09 · Cantril Ladder Conversion

If we must rely on the Cantril ladder measure, how would we best convert it into a welfare metric for comparing interventions?

2. Conversions Between Measures

How should we convert between WELLBYs and DALYs/QALYs?

PQ2 · DALY/QALY–WELLBY Conversion · DALY_01/03/05 · Metaculus →

If some programs are measured in WELLBYs and others in DALYs/QALYs, what is the best numerical conversion or mapping between them—and what method or approach should we use?

From DALY_01: "If the impact of one program is measured in WELLBYs [...] and another in DALYs, what is the best numerical conversion or mapping between them?" Also from DALY_03: "What method or 'mapping structure' should we use?" (Note: QALYs may be more relevant than DALYs for this conversion—see context.)

"Best" here means: the mapping that, if used for funding decisions, would lead to the highest expected welfare. Getting this conversion wrong means systematically over- or under-investing in mental health versus physical health interventions.

DALY vs QALY: DALYs measure health burden (years lost to disease/disability); QALYs measure health gained. For conversion purposes, QALYs are often more directly comparable—the canonical questions note "replace DALY with QALY" may be appropriate.
Some organizations (including HLI and Founders Pledge) currently treat SDs on different mental health instruments as interconvertible with WELLBY SDs on a roughly 1:1 basis.
The conversion between DALYs/QALYs and WELLBYs depends on the "neutral point"[6]The neutral point is the life satisfaction level where welfare equals zero—below this, welfare is negative. If neutral=5, then LS=3 represents negative welfare; if neutral=2, LS=3 is positive. on the LS scale—the point below which life has negative value. This is currently unknown; one small study (Peasgood et al. 2018) suggested LS ≈ 2, but this is tentative.
The relationship may also be non-linear—e.g., a WELLBY gained at very low wellbeing could be worth more than one gained at high wellbeing.
Approaches include: SD-equivalence (current practice), regression-based approaches (linking LS data to DALY weights in the same populations), time-tradeoff surveys, or maintaining separate analyses and comparing rankings.

If you had to choose a single linear mapping for converting DALYs to WELLBYs (i.e., 1 DALY averted = X WELLBYs), what value of X would lead to the best decisions in typical funding contexts? (Reference: UK Treasury uses ~6-7 for QALYs; literature ranges ~2-15.)

What is your 80% credible interval for this conversion factor?

This is more informative than a single "confidence" percentage because it captures both your best guess and how uncertain you are. For help calibrating your uncertainty estimates, try the Clearer Thinking calibration tool.

Lower bound (10th percentile)

Upper bound (90th percentile)

What's your reasoning? Does the "best" mapping depend on context (e.g., mental health vs physical health, geographic region)?

What conversion method or approach would you recommend? (E.g., direct units vs standard deviations, linear vs non-linear, context-specific factors, etc.)

DALY_02 · Founders Pledge Specific

Which mapping between WELLBYs and DALYs should Founders Pledge specifically use for comparisons like the focal example (StrongMinds vs malaria)?

This asks about the best mapping for their particular use case, rather than a general-purpose conversion.

DALY_05 · Loss from SD-SD Approach

What is the loss from the "1 SD change in WELLBY ≈ 1 SD change in DALY" approach currently used by some funders, relative to the best feasible approach?

Where will this approach be particularly incorrect? Consider different intervention types, populations, or contexts.

3. Predictions and Policy

Forecasting questions about expert consensus, research uptake, and measurement impact.

PQ3b · Metaculus-style · Expert Consensus

If The Unjournal were to survey development economists and research-informed practitioners (before end of 2027), what share would agree that "the linear WELLBY (as defined above) is a reasonably useful measure in this context, and switching to a different measure is unlikely to add much value"?

(Note: This is a hypothetical scenario for discussion. We are not currently planning to conduct such a survey, though we would like to if feasible.)

Your estimate of the share who would agree (0-100%)

Brief reasoning (optional)

PQ3a · Metaculus-style · Research Uptake · Metaculus →

By 2030, will more than 50% of GiveWell's top charities include a WELLBY-based cost-effectiveness analysis alongside or instead of DALY-based analysis?

This illustrative forecasting question gauges whether the WELLBY will gain institutional traction. (Note: This is a discussion question for the workshop, not from the canonical PQ table.)

State your best calibrated probability

25%

Brief reasoning (optional)

PQ3c · Metaculus-style · Calibration Impact

If calibration questions and/or vignettes (as in Benjamin et al.) were added to the major wellbeing surveys used in global health RCTs, would the resulting adjustments meaningfully change the cost-effectiveness ranking of the top 5 interventions recommended by Founders Pledge?

"Meaningful change" = at least one intervention currently in the top 5 moves out of the top 5, OR the #1 ranked intervention changes.
This assumes future RCTs incorporate these methods and Founders Pledge updates their CEA accordingly.
Note: This question is somewhat speculative—it asks about counterfactual methodology adoption and its downstream effects.

State your best calibrated probability that adopting these methods would meaningfully change the top-5 rankings (see resolution criteria above)

35%

Brief reasoning (optional)

About You

Name or pseudonym *

Email (optional)

Submission type (optional)

Affiliation (optional)

Metaculus username (optional)

Any other thoughts, questions, or considerations we should be aware of? (optional)

Your responses are stored securely and will be used to inform the synthesis report.

Questions adapted from the canonical Wellbeing PQ formulations (codes: WELL_01–07, DALY_01–05). Last updated: February 2026.

Notes

Pivotal Questions: research questions where credible evidence could most shift funding decisions. We identify these through stakeholder collaboration, prioritizing by expected value of information.
Happier Lives Institute uses WELLBYs as their primary metric; Founders Pledge incorporates them alongside DALYs. GiveWell has explored WELLBY-based analysis but hasn't fully adopted it.
Scale-use heterogeneity: different people use the 0-10 scale differently—what one person calls "7" might correspond to another's "5." This creates bias when comparing across individuals or groups.
Calibration questions have objectively correct answers (e.g., "1+1=?") that reveal how respondents use scales. If someone rates 2 as "very certain," we know they compress the scale.
Vignettes describe hypothetical people ("John has X, Y, Z characteristics"). By having respondents rate these standardized scenarios, researchers can compare individual scale use.
The neutral point is the life satisfaction level where welfare equals zero—below this, welfare is negative. If neutral=5, then LS=3 represents negative welfare; if neutral=2, LS=3 is positive.
Full specifications and context available in the canonical Wellbeing PQ formulations on Coda. (Note: the Coda page contains detailed operationalizations and may be more than you need for the beliefs exercise — the questions below are self-contained.)
Some questions are also posted to The Unjournal's Metaculus forecasting page for crowd prediction.

About this page

Some content on these workshop pages was drafted with AI assistance based on our own source materials, notes, and research. We've reviewed it carefully, but errors or unclear passages may remain. If you spot anything incorrect or confusing, please let us know—ideally via a Hypothes.is annotation on the relevant passage, or by emailing contact@unjournal.org.

Wellbeing Pivotal Questions

🔬 Background: The Focal Case

Shared Definitions

1. WELLBY Reliability and Value

How reliable is the linear WELLBY measure [...] relative to other available measures in the 'wellbeing space'? How much insight is lost by using linear WELLBY and when will it steer us wrong?

Operationalized version

Given the available collected data [...], how should [funders] measure the impact on wellbeing? [...] What measures of well-being should charities, NGOs, and RCTs collect for impact analysis?

Operationalized version

If you propose a measure other than linear WELLBY in your answer above, how much more would it cost to achieve the same welfare improvement using linear WELLBY instead?

In contexts where interventions impact mental health, physical health, AND consumption: is it better to use a single WELLBY measure, or measure each dimension separately and then convert/combine?

How much insight is lost by using WELLBY relative to other available measures in the "wellbeing space"? When will it steer us wrong?

Would it be better to base the metric on life satisfaction or instantaneous experience measures (e.g., happiness, affect balance)?

If we must rely on the Cantril ladder measure, how would we best convert it into a welfare metric for comparing interventions?

2. Conversions Between Measures

If some programs are measured in WELLBYs and others in DALYs/QALYs, what is the best numerical conversion or mapping between them—and what method or approach should we use?

Which mapping between WELLBYs and DALYs should Founders Pledge specifically use for comparisons like the focal example (StrongMinds vs malaria)?

What is the loss from the "1 SD change in WELLBY ≈ 1 SD change in DALY" approach currently used by some funders, relative to the best feasible approach?

3. Predictions and Policy

If The Unjournal were to survey development economists and research-informed practitioners (before end of 2027), what share would agree that "the linear WELLBY (as defined above) is a reasonably useful measure in this context, and switching to a different measure is unlikely to add much value"?

By 2030, will more than 50% of GiveWell's top charities include a WELLBY-based cost-effectiveness analysis alongside or instead of DALY-based analysis?

If calibration questions and/or vignettes (as in Benjamin et al.) were added to the major wellbeing surveys used in global health RCTs, would the resulting adjustments meaningfully change the cost-effectiveness ranking of the top 5 interventions recommended by Founders Pledge?

About You

Notes