Methodology · Backtest

Backtest Methodology

PhaseFolio validates probability-of-success predictions against historical drug outcomes using held-out cohorts whose fates are now known. Three cohorts are published: rheumatoid arthritis (n=16, AUC 0.625), non-small-cell lung cancer (n=59, AUC 0.709), and antimicrobial (n=36, AUC 0.629). Each page leads with the strongest signal in its cohort — pairwise AUC where discrimination is solid, Wilson-CI accuracy at the optimal Youden cutoff at small n. The antimicrobial cohort's pre-Sprint-1 discrimination gap (AUC 0.524; engine PoS identical for successes and failures) was closed to 0.629 by one cohort-validatable scored multiplier (single-asset sponsor fragility); two other candidate multipliers were deliberately demoted to non-scored risk flags after a pre-publication ablation, and the full ablation (baseline 0.524 / M3-only 0.631 / M1+M2-only 0.797 / all-three 0.782) is published rather than only the largest number. Cohorts differ in decision anchor (Phase 2 for RA/NSCLC, Phase 3 for antimicrobial) by a single disclosed rule: anchor at the earliest decision point at which the cohort's failure population is registry-observable. Full antimicrobial Sprint-1 forensics, the C. difficile sub-cohort, and per-drug ledgers are published at /research/backtests/antimicrobial.

What a backtest measures

Two questions: discrimination and absolute calibration.

A drug-stage backtest takes a fixed cohort of drugs that entered a clinical phase by a cutoff date and asks two distinct questions: did the model rank successes above failures (discrimination), and when the model said “30% PoS,” did roughly 30% of those drugs actually succeed (absolute calibration)? These are different questions, answered by different metrics, with different sensitivities to cohort construction.

Wilson score interval on accuracy

Why a small-sample binary accuracy needs a confidence interval.

Beyond ranking, we report a binary call accuracy: the model “calls approved” whenever its predicted cumulative PoS at entry exceeds a cutoff, and the call is correct iff that matches the observed outcome. We report this at two cutoffs: the conventional ≥50% classifier midpoint, and the ≥40% optimal-Youden cutoff identified by the threshold sweep.

At small cohort sizes a raw percentage like “75% accurate” is misleading because it implies a precision the data cannot support. We therefore wrap every accuracy figure in a Wilson score interval at 95% confidence. The Wilson interval is the standard binomial-proportion CI for small n — it is asymmetric, never crosses 0 or 1, and behaves correctly when the observed proportion is at the boundary (which it often is for small biotech cohorts).

Equation 1 — Wilson score interval

CI = ( p̂ + z²/(2n) ± z × √[ p̂(1−p̂)/n + z²/(4n²) ] ) / ( 1 + z²/n )

where p̂ = observed accuracy (correct calls / cohort size), n = cohort size, and z = 1.96 for a 95% CI.

Discrimination vs. absolute calibration

Why each backtest page leads with the metric that’s strongest in its cohort.

Discrimination (AUC) measures pairwise ranking. The area under the ROC curve is the probability that a randomly chosen approved drug received a higher predicted PoS than a randomly chosen failed drug. AUC ranges from 0.5 (no skill) to 1.0 (perfect ranking). It is invariant to the absolute level of predicted probabilities — a model that systematically under-predicts by 20 percentage points across the board can still achieve perfect AUC if its ordering is correct.

Absolute calibration measures whether predicted probabilities match observed frequencies. When the model says 30% PoS, do roughly 30% of those drugs ultimately succeed? This is the metric a valuation tool is judged on most directly — calibration drives sizing and discounting decisions, not ranking. Calibration is sensitive to cohort selection: a cohort weighted toward registry-visible survivors shows points above the diagonal independent of engine accuracy.

Both questions matter. Each backtest page leads with whichever metric is the strongest signal in its cohort: AUC when discrimination is solid (≥ 0.70 cutoff), Wilson-CI accuracy at the optimal Youden cutoff when cohort size makes a single-number AUC less informative.

Calibration plot

Predicted vs. observed by quintile — with the survivorship caveat.

The calibration plot bins predicted probabilities into quintiles and overlays observed approval frequency. Perfect calibration sits on the diagonal; points above the diagonal indicate the cohort’s observed approval rate exceeds what the engine predicts at that bucket.

Calibration plots inherit cohort selection bias. PhaseFolio’s cohorts are built from drugs whose Phase 2 entry could be reliably identified in public registries — a survivor-biased subset of the universe of all programs that ever entered Phase 2. Engine PoS values are calibrated to the population base rate (BIO/QLS 2021); the cohort’s observed approval rate is higher than the population because invisibly-failed programs are absent. The vertical gap between the engine’s prediction and the cohort’s observed rate therefore reflects cohort survivorship at least as much as engine miscalibration. Pairwise AUC, by contrast, is invariant to this bias provided successes and failures are equally well-represented in the cohort.

Published cohorts

Three held-out cohorts, side by side — full per-cohort detail one click away.

Metric	Rheumatoid arthritis	NSCLC	Antimicrobial
Cohort size	16 (Phase 2 entrants)	59 (41 approved / 18 failed)	36 (25 approved / 11 not)
Decision anchor	Phase 2 entry	Phase 2 entry	Phase 3 entry
Lead signal	Wilson-CI accuracy	Pairwise AUC	Pairwise AUC + gap disclosure
Pairwise AUC	0.625	0.709 (738 pairs)	0.629 (was 0.524)
Secondary metric	12/16 = 75.0% acc · 95% CI 51–90% (≥40% Youden)	mean PoS 8.3% vs 3.6% · sep 4.7pp	25/36 = 69.4% · 95% CI 53–82% · sep 0.7pp
Status	Directional (small n)	PASS (≥0.70)	PASS (≥0.60) post-Sprint-1
Engine	1.0.0	1.0.0	1.0.0 + AMR multipliers + Sprint-1 M3
Full results	View backtest →	View backtest →	View backtest →

Success criterion is indication-specific FDA approval for all three. Per-drug ledgers and quintile calibration plots are in the intelligence dashboard. The antimicrobial cohort is LLM CMO-grade verified (Claude Opus 4.7 acting in a chief-medical-officer reviewer role, not a human medical officer) against ClinicalTrials.gov NCT records, FDA approval letters, and SEC 8-K filings. RA (0.625) and NSCLC (0.709) were re-run as regression post-Sprint-1 and are number-identical — the antibacterial multipliers are no-ops outside the antimicrobial cohort.

The Engine row names the version that computed each prediction — the 1.0.0 base path (BIO/QLS 2021 transition probabilities), with the antimicrobial cohort additionally carrying the Sprint-1 M3 scored multiplier — not the latest release. The published AUCs are unchanged through the current engine (2.6.0): the scored backtest path consumes only base PoS, not the 2.0.0–2.6.0 revenue/deal, Monte-Carlo-distribution, or oncology drug-specific-layer changes (the last scores only when fed a per-drug biomarker classification these cohorts do not supply). RA was re-run on 2026-05-29 under the current engine and reproduced 0.625 to the digit.

Antimicrobial Sprint-1 — substrate-honest summary (2026-05-16)

Pre-Sprint-1 the antibacterial Phase 3-entry PoS was well-calibrated as a point estimate (mean ~0.91 vs observed 25/36 = 69.4%) but did not discriminate approved from failed (AUC 0.524; mean PoS identical for successes and failures). Sprint-1 tested three candidate antibacterial multipliers, each pre-registered on evidence dated before each drug’s decision date. A pre-publication ablation then decided which could legitimately score the engine:

Configuration	AUC	Scoring decision
Baseline (no Sprint-1)	0.524	—
M3 only	0.631	Cohort-validatable → scores
M1 + M2 only	0.797	Fires only on failures → does not score
All three	0.782	Not shipped

Decision: only M3 scores. Single-asset sponsor fragility (M3) is the sole scored Sprint-1 signal because it is the only one the cohort can validate — it fires on three approvals (plazomicin, eravacycline cIAI, lefamulin) as well as failures. Final shipped scored AUC = 0.629 (PASS ≥0.60), after a same-day LPAD-gate fix (−0.002, immaterial). M1 (hepatotoxicity class) and M2 (SCR endpoint fragility) fire only on this cohort’s failures with no approved counterexample, so the cohort structurally cannot self-validate them; they were demoted to non-scored risk flags (raising risk-flag sensitivity 72.7% → 90.9% without inflating the scored AUC). We publish the full ablation, not just the largest number (0.782), because a headline that is mostly an unvalidatable imported prior is not one a CMO advisor should be asked to trust.

Full Sprint-1 forensics — per-multiplier rationale, the C. difficile sub-cohort that does not separate, the LPAD-gate fix, and the per-drug ledger →

Sample limitations

What these cohorts can and cannot tell you.

Cross-cohort comparability and anchor selection. All three cohorts use indication-specific FDA approval as the success criterion but differ in decision anchor by design: RA and NSCLC at Phase 2 entry, the antimicrobial cohort at Phase 3. One rule drives this — anchor at the earliest decision point at which the cohort’s failure population is observable in public registries, so the cohort is not survivorship-truncated on the failure side. Oncology and RA Phase-2 failures are densely registered, so Phase-2 anchoring is unbiased there; antibacterial Phase-2 deaths are mostly small-biotech business discontinuations that are not registry-observable. A reproducible scan of the antimicrobial substrate (4,102 trials / 81 distinct drugs) finds only 7 Phase-2-terminal programs, ≤4 outside the cohort, none registry-flagged as failed — effectively zero clean Phase-2 antibacterial failures, so a Phase-2-anchored antibacterial cohort would be survivorship-fatal. Phase 3 is the earliest anchor at which that universe is small, bounded and FDA-traceable (hence primary-source-complete at n=36). The per-indication anchor difference is a disclosed consequence of data observability, not an inconsistency; cohorts are published per-indication, not aggregated into one calibration plot. Full treatment.
Backtest coverage of a specific scenario. The backtests validate the default BIO/QLS-2021 benchmark probabilities, and only within the three published cohorts (rheumatoid arthritis, NSCLC, antimicrobial). A scenario inherits that validation only where its stage PoS values are the benchmark defaults; any value adjusted away from the default is outside the backtested regime by definition. PhaseFolio does not infer whether a given asset falls inside a backtested cohort — scenarios are classified by broad indication bucket (e.g. Immunology, Oncology — Solid Tumor, Infectious Disease), not the sub-indication the cohorts are defined at — so the platform makes no automated per-scenario “validated” claim. Signed exports carry a per-stage provenance line (benchmark default vs user-adjusted) so a reviewer can see exactly which inputs this backtest evidence bears on; whether a cohort applies to the asset in hand is a judgment for the reader, not an automated assertion.
Survivor bias in source data. Cohorts are built from drugs whose Phase II entry could be reliably identified in public registries. Programs that died before public disclosure are unrepresented; this biases observed approval rates upward by an unknown amount and inflates points above the diagonal in calibration plots independent of engine accuracy.
Wide confidence bands at small n. The 95% Wilson interval on RA accuracy spans roughly (51%, 90%). The point estimate alone is not a trustworthy summary; the interval is the right object to cite. NSCLC at n=59 supports tighter intervals.
Modifier sparsity. Within each cohort, several modifier combinations appear once or zero times. The backtest cannot distinguish whether the genetic-validation modifier or the orphan-designation modifier is doing more work; cohorts are too small for sub-stratification.
Discrimination ≠ calibration. AUC is a ranking metric; absolute calibration is a sizing metric. A model with poor calibration can still have respectable AUC. Use AUC for pick-the-winner questions; use the calibration plot for size-the-bet questions, with the survivorship caveat.
Engine evolution & remediation. The antimicrobial build used a corrected trial-duration computation; an audit then found the prior method had under-recorded durations in the RA enrichment substrate (NSCLC was checked and was unaffected). The RA substrate has been recomputed and the root cause fixed at source. This field is not consumed by the scored backtest path, so published AUCs (RA 0.625 / NSCLC 0.709 / antimicrobial 0.629), which derive from cohort-level stage assumptions, are unchanged — the correction affects only customer-facing duration figures in exports. The build also added first-class antibacterial support (an infectious-disease endpoint-tier taxonomy and the correct FDA review-division mapping). Named here rather than applied silently — a methodology worth trusting names its own corrections.

References

01Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209–212.

02Brown, L.D., Cai, T.T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101–133.

03Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.

04Thomas, D.W., Burns, J., Audette, J., Carroll, A., Dow-Hygelund, C., & Hay, M. (2021). Clinical Development Success Rates and Contributing Factors 2011–2020. BIO, QLS Advisors, Informa Pharma Intelligence.

05Youden, W.J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35.

Methodology version: methodology@2026-06-07 · Last updated: 2026-06-07 · Version history →