Methodology · PoS Calibration

Probability of Success Calibration

PhaseFolio derives stage-transition probabilities from observed clinical outcomes rather than expert opinion, following Thomas et al. (2021) and Wong et al. (2019). The benchmark matrix is three-dimensional (11 indications × 8 modalities × 3 biomarker strategies = 264 cells); evidence-based multipliers are applied through a log-odds (logit) transformation to keep results bounded and reflect diminishing returns at high baselines. A multiplier is allowed to score the engine only if a held-out cohort containing both approvals and failures can validate it; signals that cannot be validated are demoted to non-scored risk flags. Engine 2.6.0 (shipped 2026-05-28) adds a drug-specific clinical signal layer — biomarker quality scores in oncology solid tumor at Phase II/III; Phase 1 objective response rate is extracted and surfaced as a flag after Phase 0 cohort validation found it double-counts biomarker quality at 50% cohort coverage.

01

Three-dimensional benchmark matrix

11 indications × 8 modalities × 3 biomarker strategies = 264 cells.

Rather than applying a single set of industry-average transition rates, PhaseFolio stratifies PoS by three independent classification axes that are known to materially affect clinical outcomes: therapeutic area, drug modality, and biomarker strategy.

Therapeutic Area

11 indications from oncology (solid and hematologic) through cardiovascular, neurology, metabolic, and rare disease. Oncology solid tumor has the lowest overall LoA (~2.5%); rare disease the highest (~9.4%).

Thomas et al. 2021

Drug Modality

8 modalities including small molecule, monoclonal antibody, bispecific, ADC, cell therapy, gene therapy, and peptide. Modality affects safety profiles and regulatory pathways.

Citeline 2024

Biomarker Strategy

3 levels: none, enrichment (biomarker-selected population), and companion diagnostic (required for Rx). Biomarker use drives Phase II and III success rates up to 4×.

Parker et al. 2021

The full matrix contains 11 × 8 × 3 = 264 unique indication–modality–biomarker combinations, each specifying five stage-transition probabilities (Preclinical → Phase I → Phase II → Phase III → NDA/BLA → Approval). Values are derived from meta-analysis of 12,728+ clinical-stage transitions [BIO/QLS/Informa 2021] with modality-specific adjustments from Citeline 2024 pipeline data.

Therapeutic AreaPreclinicalPhase IPhase IIPhase IIINDA/BLAOverall LoA
Oncology (Solid)5.0%40.0%24.0%55.0%90.0%2.4%
Oncology (Hematologic)7.0%72.0%42.0%63.0%90.0%12.0%
Rare Disease8.0%56.0%38.0%64.0%93.0%9.4%
Neurology4.0%46.0%20.0%47.0%88.0%1.5%
Immunology6.0%49.0%30.0%58.0%91.0%4.6%
Infectious Disease7.0%52.0%36.0%62.0%92.0%7.4%
Cardiovascular5.0%48.0%28.0%55.0%90.0%3.3%
Metabolic6.0%50.0%32.0%58.0%91.0%5.1%
Respiratory5.0%47.0%26.0%54.0%90.0%2.9%
Dermatology6.0%50.0%34.0%60.0%91.0%5.6%
Ophthalmology6.0%50.0%30.0%58.0%91.0%4.7%

Table 1. Baseline stage-transition probabilities by therapeutic area (small molecule, no biomarker strategy). Overall Likelihood of Approval (LoA) is the product of all five transition rates. Source: BIO/QLS/Informa 2021; Wong et al. 2019.

02

Multiplier adjustments

Eight scored multipliers, each applied where the source measured them.

Several evidence-based factors are known to shift clinical success probabilities relative to the population base rate. PhaseFolio applies these multipliers via a log-odds (logit) transformation — the mathematically correct method when a multiplier is a true odds ratio. The cited sources, however, report effect sizes in different forms: relative success ratios (Minikel 2024), phase success-rate comparisons (Parker 2021), relative approval rates (Mullard 2016), and a cumulative pipeline advantage (Tufts NEWDIGS 2023).

The engine currently treats all of these through the OR-style logit path as a deliberately conservative approximation — this under-credits favorable modifiers at higher baselines, never saturates to 1.0, and avoids stacked-modifier overshoot. Per-modifier estimand declarations (_source_estimand and _applied_as) are recorded in the machine-readable source and explained in the model card. Each multiplier is applied only to the clinical phases where the underlying evidence was measured.

ModifierMultiplierSource estimandApplied asStagesSource
Genetic Validation2.6×RROR (logit)II, IIIMinikel et al. 2024, Nature
Companion Diagnostic2.0×RROR (logit)II, IIIParker et al. 2021
Orphan Designation1.5×RROR (logit)II, IIIMullard 2016, Nat. Rev. Drug Disc.
Biomarker Enrichment1.5×RROR (logit)II, IIIParker et al. 2021; BIO 2021
First-in-Class0.85×RROR (logit)II, IIIBIO/QLS 2021
CAR-T / TCR Therapy1.73× / stageRR (cum. 3×)OR (logit)I, IITufts NEWDIGS 2023
Gene Therapy (Orphan)1.41× / stageRR (cum. 2×)OR (logit)I, IITufts NEWDIGS 2023
Biomarker Quality — Genomic Validated†1.35×RROR (logit)II, IIISchwaederle et al. 2016
Biomarker Quality — Protein Only†0.85×RROR (logit)II, IIISchwaederle et al. 2016

Table 2. Scored multipliers, with the source estimand (what the literature reports) separated from the applied path (how the engine treats it). Favorable multipliers (>1) increase PoS; unfavorable (<1) decrease it. CAR-T and gene-therapy-orphan per-stage values are sqrt(cumulative) splits of the source's whole-pipeline advantage. † Biomarker Quality is the drug-specific multiplier added in engine 2.6.0 (2026-05-28) and applies only to oncology solid tumor; see Section 6 for bucket definitions, cohort validation, and the literature-anchor vs cohort-derived disclosure.

A critical design decision: preclinical PoS is never adjusted by any multiplier. Preclinical attrition is dominated by toxicology, pharmacokinetics, and formulation failures [Zhou et al. 2025] — factors orthogonal to the clinical efficacy signals that these multipliers capture. Similarly, NDA/BLA approval rates reflect regulatory filing quality rather than drug-specific clinical attributes, and are therefore held constant.

03

Logistic transformation method

Why naive multiplication is wrong, and what log-odds space buys you.

Applying odds ratios to bounded probabilities requires care. Naive multiplication (PoS × OR, capped at 1.0) produces mathematically unsound results: a drug with 50% base PoS and a 2.6× genetic-validation multiplier would yield 130%, capped to 100% — falsely claiming certainty. With multiple favorable multipliers stacking, this problem cascades rapidly.

PhaseFolio instead applies multipliers in log-odds (logit) space — the standard biostatistical transformation for adjusting bounded probabilities by a multiplicative factor. The three-step transformation:

Equation 1 — Logistic odds-ratio adjustment
Step 1.  odds = PoSbase / (1 − PoSbase)
Step 2.  oddsadj = odds × OR
Step 3.  PoSadj = oddsadj / (1 + oddsadj)

This approach has three desirable mathematical properties:

  1. Bounded output. The result is always in (0, 1) — it can never reach 0% or 100%, regardless of how many multipliers are stacked.
  2. Diminishing returns. A 2.6× OR applied to a 24% base PoS yields 45.1% (+21.1pp). Applied to a 70% base, it yields 85.8% (+15.8pp). The higher the base, the harder it is to push higher — matching clinical reality.
  3. Composability. Multiple ORs applied sequentially produce the same result regardless of order, because multiplication in log-odds space is commutative.
04

Worked example: PoS derivation

Rare disease small molecule with genetic validation and orphan designation.

Consider a rare disease small molecule with genetic validation and orphan designation. We derive the Phase II PoS step-by-step.

Phase II PoS derivation — rare disease, small molecule

Base rate (from benchmark matrix)

Phase II PoS = 38.0%

Apply genetic validation (factor 2.6, applied as OR)

odds = 0.38 / (1 − 0.38) = 0.613

odds × 2.6 = 1.594

PoS = 1.594 / (1 + 1.594) = 61.5%

Factor source: Minikel et al. 2024 (RR); applied as OR — see model card.

Apply orphan designation (factor 1.5, applied as OR)

odds = 0.615 / (1 − 0.615) = 1.597

odds × 1.5 = 2.396

PoS = 2.396 / (1 + 2.396) = 70.6%

Factor source: Mullard 2016 (RR); applied as OR — see model card.

Adjusted Phase II PoS

70.6%

Note: naive multiplication would yield min(1.0, 0.38 × 2.6 × 1.5) = 100% — clearly incorrect. The logistic method produces 70.6%, reflecting appropriate diminishing returns.

05

Guarding against overfitting

Which multipliers are allowed to score, and which are demoted to flags.

Every multiplier in Table 2 adds a degree of freedom, and a stack of adjustable factors can manufacture the appearance of rigor while quietly encoding the author's priors — the central failure mode of any heuristic valuation model. PhaseFolio constrains this in three ways: two structural, one evidentiary.

Structural bounds (Sections 2–3). Multipliers are applied only in log-odds space, so stacked factors can never saturate to 0% or 100% and never overshoot; each factor is applied only to the clinical phases where its source measured the effect; and preclinical and NDA/BLA rates are never adjusted at all. These bounds cap how far the knobs can move any result, regardless of how many fire at once.

The evidentiary gate

A multiplier may only score the engine if a held-out cohort can validate it. A candidate factor earns the right to change a probability only when a backtest cohort containing both approvals and failures shows it discriminates between them — it must fire on known successes as well as known failures, so a skeptic can confirm it tracks outcomes rather than merely labelling the failures after the fact. A signal that fires only on the failures in a cohort, with no approved counterexample, cannot be validated by that cohort; it is demoted to a non-scored, display-only risk flag rather than allowed to move the number.

Worked proof. In the antimicrobial backtest, three candidate antibacterial multipliers were pre-registered on evidence dated before each drug's decision. A pre-publication ablation showed that two of them — a hepatotoxicity mechanism-class prior and a sustained-clinical-response endpoint-fragility prior — fired only on that cohort's failures, with no approved counterexample, so they were demoted to flags. Only the third, single-asset sponsor fragility (which fires on three approvals as well as the failures), was allowed to score. We publish the full ablation rather than the most flattering configuration: scoring only the validatable factor yields a pairwise AUC of 0.629, whereas the unvalidatable pair would have produced a higher but uncheckable 0.797. We report the lower, defensible number.

This gate governs every future candidate multiplier: no factor scores the engine until a cohort with both outcomes can independently confirm it. Until then it may inform a risk flag, but it does not move the valuation.

Full worked detail: backtest methodology → · antimicrobial Sprint-1 forensics →

06

Drug-specific clinical signals — biomarker quality

The first drug-specific multiplier to clear the governance gate. Engine 2.6.0, 2026-05-28.

Through engine 2.5.x, every multiplier in Table 2 keyed on attributes of the program (modality, biomarker strategy, orphan designation, indication-level genetic validation). Engine 2.6.0 introduces a second class of signal — drug-specific clinical attributes extracted from each asset's underlying evidence (pivotal-program publications, FDA labels, sponsor disclosures, registry records). The first such signal to clear the multiplier-governance gate of Section 5 is biomarker quality.

Definition. Biomarker quality refines the biomarker-strategy axis of Table 1 by asking what kind of biomarker a program is built on, not merely whether one is present. Three scoring buckets ship in engine 2.6.0:

BucketMultiplierDefinition
genomic_validated1.35×Sequenced, mechanism-anchored DNA/RNA alteration (EGFR exon 19 del, BRAF V600E, ALK fusion, MSI-H) used as enrichment.
protein_only0.85×Protein-expression biomarker without a genomic anchor (HER2 IHC alone, PD-L1 TPS, serum protein).
unknown1.00×Not extracted or not applicable. No adjustment.

Table 3. Biomarker-quality buckets shipped in engine 2.6.0. Applied only to oncology solid tumor at Phase II and Phase III. Hematologic oncology, immunology, neurology, and every other indication carry no biomarker-quality multiplier. Preclinical, Phase I, and NDA/BLA stages are not adjusted (Section 2 design decision).

Source and estimand. The 1.35× / 0.85× values are reported as relative response/success ratios in Schwaederle et al. 2016 — a meta-analysis of phase II precision-medicine trials across 13,203 patients in 346 studies — and applied through the OR-style logit path of Sections 2 and 3, the same conservative dispatch used by every other multiplier in Table 2.

Cohort validation. A 43-drug oncology-solid-tumor cohort (50% of the 85-drug Phase 0 cohort universe, with approvals drawn from FDA approvals 2018–2024 and failures from public discontinuations) scored biomarker_quality against the engine's baseline. The signal cleared the governance gate of Section 5:

  • Cohort N: 43 oncology solid tumor; bucket-level minimum N ≥ 6 with ≥ 3 sponsors per bucket (the antimicrobial Sprint-1 precedent).
  • Pairwise AUC: baseline 0.618 → biomarker_quality alone 0.670 (+5.2pp). Stable across the v2, v3, and v4 extractor rounds at 28, 35, and 43 drugs respectively.
  • The signal fires on both approvals and failures — the both-outcomes gate.

Honest disclosure — literature anchor is conservative

The cohort fit yielded a genomic_validated odds ratio of approximately 5.59 — markedly higher than the 1.35× we ship. The implication is that the Schwaederle (2016) anchor is now roughly a decade old and likely understates the discrimination of modern targeted oncology relative to current practice. We ship the lower literature anchor anyway, on the same discipline as the antimicrobial backtest (Section 5): we publish the defensible, externally-citable value rather than the higher cohort-derived value, until the cohort is large enough to justify a recalibration. Closing this gap is on the Phase 2 roadmap.

Human-review gate. Every extracted drug-specific signal carries a reviewed_at timestamp. Unreviewed signals do not score the engine — the engine treats reviewed_at IS NULL as inert. A human reviewer (CMO-grade by methodology design) must approve, reject, or amend each signal before it can influence rNPV. Reviewer identity, decision, and timestamp are stamped on every signed export under engine 2.6.0. A 10% deterministic second-pass audit (minimum 10 rows per cohort, seeded for reproducibility) checks reviewer drift; disagreement above 5% triggers a re-extract with a stricter prompt and a re-review.

07

Phase 1 objective response rate — extracted but not scored

The governance gate working as designed: a signal that looked promising at 28 drugs failed at 43 and was demoted.

Phase 1 objective response rate (ORR) is the second drug-specific signal PhaseFolio extracts under engine 2.6.0. It is captured, surfaced in the dossier, and stamped on signed exports — but it does not score the engine in 2.6.0. This section documents why, in the same place the engine documents its scoring multipliers.

What is extracted. For each oncology-solid-tumor program with Phase 1 efficacy data, the extractor captures: ORR (percent); the modality class the program belongs to (antibody_targeted, small_molecule_targeted, immune_checkpoint, cytotoxic); the source type (FDA label, pivotal paper, registry, sponsor disclosure); a verbatim source excerpt; and a citable URL. Modality-specific high/low thresholds are pre-registered:

Modality classHigh ≥Low <
antibody_targeted40%15%
small_molecule_targeted50%20%
immune_checkpoint30%10%
cytotoxic45%25%

Table 4. Pre-registered modality-conditional Phase 1 ORR thresholds. The reported ORR is adjusted toward blinded-independent-central-review (BICR) values — investigator-assessed ORR can run modestly higher than central review (Zhang et al. 2017 reports central assessment slightly lower than local, with high concordance) — and is further discounted for Phase 1 winner's-curse (Vreman et al. 2019). The adjusted value classifies into high / mid / low buckets per modality.

Why it does not score in engine 2.6.0. The Phase 0 validation backtest scaled from 28 to 43 drugs (50% cohort coverage) and found:

  • biomarker_quality alone: pairwise AUC +5.2pp over baseline — stable across extractor rounds.
  • phase1_orr alone: marginally validatable. Small positive lift at lower sample size, not robust at 43 drugs.
  • biomarker_quality + phase1_orr combined: at 28 drugs the combined AUC showed +8.8pp lift; at 43 drugs (50% cohort coverage) the combined AUC fell below baseline (-0.3pp).

The combined +8.8pp at 28 drugs was a small-sample artifact. At higher coverage the two signals double-count correlated information: a high Phase 1 ORR is strongly conditioned on the biomarker that already scores the program, so adding phase1_orr on top of biomarker_quality re-weights the same evidence twice. The combined score actively degraded discrimination.

Governance decision — engine 2.6.0

  1. Ship biomarker_quality only as the scoring drug-specific multiplier.
  2. Continue to extract, display, and stamp phase1_orr on signed exports — the data is captured for diligence transparency and for the engine 2.7.0 recalibration cycle.
  3. Defer scoring phase1_orr to engine 2.7.0 pending either (a) a larger cohort that supports a recalibrated independent lift, or (b) a conditional-multiplier framework that applies phase1_orr only when biomarker_quality is unknown, avoiding the double-count.

This is the multiplier-governance gate working as designed. A signal that looked promising at 28 drugs failed the both-outcomes validation at 43 drugs and was demoted to flag-only. We disclose the demotion in the same place the engine documents its scored multipliers; we do not publish only the larger combined number.

08

Non-scored risk flags and provisional disclosure

Six diligence-aid signals that inform but do not move the rNPV. Plus the authorship and outside-review disclosure for engine 2.6.0.

PhaseFolio extracts additional drug-specific attributes that inform diligence but do not score the engine. Each is rendered in the dossier as a flag (informational, positive, or warning) with a citable source; none move the rNPV value.

SignalRuleFlag → severity
Sponsor prior approvals (count)0 / 1–3 / ≥ 4NONE (info) / SOME (info) / HIGH (positive)
Grade 3+ adverse-event rateModality-thresholded (≥ 30% / 35% / 50%)G3_PLUS_AE_ELEVATED (warning) / SAFETY_PROFILE_CLEAN (positive)
Trial randomizationYes / noTRIAL_RANDOMIZED (info) / TRIAL_SINGLE_ARM (info)
Primary endpoint typeSurrogate / clinicalPRIMARY_ENDPOINT_SURROGATE (info) / PRIMARY_ENDPOINT_CLINICAL (info)
Sample-size target< 60 patientsSAMPLE_SIZE_UNDERPOWERED (warning) / SAMPLE_SIZE_ADEQUATE (info)
Indication-specific surrogacy R²< 0.40SURROGACY_R2_LOW (warning) / no flag above

Table 5. Non-scored drug-specific risk flags surfaced in the dossier and stamped on signed exports. Each would need an independent both-outcomes cohort validation to earn the right to move the number. They are useful diligence anchors — and they appear in the signed export so a reader can verify which fired — but the rNPV math is unchanged whether they are present or absent.

8.1 Authorship, AI assistance, and outside review

The drug-specific clinical signal layer was developed by PhaseFolio's non-MD founder using Anthropic Claude Opus 4.7 as the extraction engine, with an independent adversarial subagent reviewing 28 of the extractions for self-consistency (14 flagged for human spot-audit, two certain errors corrected before ship, five likely flagged, seven soft). The Phase 0 GO recommendation and the underlying validation data were prepared for two outside reviewers as of the methodology@2026-05-28 ship date and are being routed to them as part of this release:

  • A HEOR / governance reviewer — was the multiplier-governance gate of Section 5 applied honestly to phase1_orr, given the small-sample artifact at 28 drugs and the demotion at 43?
  • An oncology clinical reviewer — are the five validatable buckets (across biomarker_quality and phase1_orr) and the 85-drug cohort itself defensible to a clinical eye?

Provisional disclosure

This methodology version (methodology@2026-05-29) is a citation-accuracy correction of methodology@2026-05-28 and is published provisional pending outside review (feedback window from 2026-05-28). Per substrate doctrine — no version is ever retroactively changed or invalidated; older versions remain valid forever and continue to verify at /verify — corrections ship as a subsequent methodology version, as this one does. The superseded methodology@2026-05-28 stamp is durable; any export issued under it remains forever-resolvable.

8.2 What's deliberately out of scope for Phase 1

Five constructs evaluated in the Phase 0 research are deliberately deferred to Phase 2 and beyond, in the spirit of “ship the disciplined subset, defer the rest”:

  • Hierarchical Bayesian PoS model. Deferred to engine 4.x or 5.x, when per-indication cohort N reaches roughly 500. The current logistic-OR transform is the right method at current cohort sizes.
  • Phase 2 readout-quality multiplier. Phase 2 ORR and randomized-vs-single-arm carry independent signal, but the AMR-Sprint-1 precedent requires a per-bucket cohort that current data does not yet support.
  • Mechanism-class hepatotoxicity prior. The antimicrobial Sprint-1 ablation already demoted this to a flag for that cohort; it remains a flag here.
  • Sustained-clinical-response endpoint-fragility prior. Same disposition.
  • Real-world-evidence calibration. Out of scope until a registry-grade RWE source clears the same governance gate as the literature anchors.

The roadmap is published openly: a signal becoming scored later is the normal path, not a methodology break. A signal staying non-scored is also a defensible disposition — transparency about what does not yet earn its way into the math is the point of this section.

§

References

01Thomas, D.W., Burns, J., Audette, J., Carroll, A., Dow-Hygelund, C., & Hay, M. (2021). Clinical Development Success Rates and Contributing Factors 2011–2020. BIO, QLS Advisors, Informa Pharma Intelligence.

02Wong, C.H., Siah, K.W., & Lo, A.W. (2019). Estimation of clinical trial success rates and related parameters. Biostatistics, 20(2), 273–286.

03Citeline (2024). Pharma Intelligence Global Clinical Trials Database. Modality-specific pipeline data used to calibrate transition rates for bispecifics, ADCs, cell therapy, and gene therapy.

04Minikel, E.V., Painter, J.L., Dong, C.C., & Nelson, M.R. (2024). Refining the impact of genetic evidence on clinical success. Nature, 629, 624–629.

05Zhou, Y., Zhang, Y., Xu, H., et al. (2025). Dynamic clinical trial success rates for drugs in the 21st century. Nature Communications, 16, 9537.

06Mullard, A. (2016). Parsing clinical success rates. Nature Reviews Drug Discovery, 15, 447.

07Parker, J.L., Kuzulugil, S.S., Pereverzev, K., et al. (2021). Does biomarker use in oncology improve clinical trial failure risk? A large-scale analysis. Cancer Medicine, 10(6), 1955–1963.

08Tufts Center for the Study of Drug Development / NEWDIGS (2023). Cell and Gene Therapy Success Rates.

09Schwaederle, M., Zhao, M., Lee, J.J., Lazar, V., Leyland-Jones, B., Schilsky, R.L., Mendelsohn, J., & Kurzrock, R. (2016). Association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignant neoplasms: a meta-analysis. JAMA Oncology, 2(11), 1452–1459. Pooled n = 13,203 patients across 346 phase II studies; basis for the biomarker_quality multiplier (genomic_validated 1.35×, protein_only 0.85×).

10Haslam, A., Olivier, T., Powell, K., Tuia, J., & Prasad, V. (2022). Eventual success rate and predictors of success for oncology drugs tested in phase I trials. International Journal of Cancer, 152(2), 276–282. Basis for the phase1_orr modality-conditional thresholds used in the engine 2.6.0 flag (currently non-scored per Section 7).

11Vreman, R.A., Bouvy, J.C., Bloem, L.T., Hövels, A.M., Mantel-Teeuwisse, A.K., Leufkens, H.G.M., & Goettsch, W.G. (2019). Weighing of evidence by health technology assessment bodies: retrospective study of reimbursement recommendations for conditionally approved drugs. Clinical Pharmacology & Therapeutics, 105(3), 684–691. Source for the Phase 1 winner's-curse discount applied in the phase1_orr extractor (Section 7).

12Zhang, J., et al. (2017). Evaluation bias in objective response rate and disease control rate between blinded independent central review and local assessment: a study-level pooled analysis of phase III randomized control trials. Annals of Translational Medicine, 5(24), 481. Basis for the conservative central-review adjustment in the phase1_orr extractor (Section 7); the study reports central assessment slightly lower than local, with high concordance.

Methodology version: methodology@2026-05-29 · Last updated: 2026-05-29 · Version history →