A retrospective calibration cohort of PhaseFolio's rNPV engine against 16 historical RA drugs, using indication-specific transition rates from 679 curated clinical trials. AUC 0.625 is early directional signal at n=16, not a confirmatory result — Wilson 95% accuracy intervals span chance.
The model achieved a pairwise AUC of 0.625 (passing the 0.60 threshold), meaning a randomly chosen eventual success outranks a randomly chosen failure 62.5% of the time. The phase-controlled AUC of 0.65 (target 0.55) confirms the signal holds within decision phase, controlling for the structural advantage later-stage decisions carry. Risk flag sensitivity reached 87.5% (7/8 failures flagged). At the best operating point — a PoS cutoff of 30% — the model achieved 62.5% accuracy with 66.7% precision and 50% recall. Discrimination passes, but calibration is weak at n=16: the separation gap (+8.4pp) and the false-confidence rate at the 25% cut (50%) both fail their targets (see §6). Every PoS multiplier is held to a validation gate: a factor may score the engine only if a held-out cohort with both approvals and failures can confirm it, otherwise it is demoted to a display-only flag rather than moving the number (see §2.4).
95% Wilson confidence intervals (n=16). Conventional ≥50% cut: 9/16 correct calls → 56.3% [33.2%–76.9%]. Optimal ≥30% cut: 10/16 correct calls → 62.5% [38.6%–81.5%]. Wilson is preferred over normal-approximation at small N because it does not produce nonsensical bounds at the extremes. AUC point estimates are reported without an interval here — small-N AUC requires a different methodology (DeLong or bootstrap), which we report in the methodology appendix rather than inline.
The back-test simulates the decision an investor or founder would have faced at the time — using only information available at each drug's go/no-go moment. No post-hoc data (trial results, FDA decisions, commercial outcomes) leaks into the inputs. This is not a prediction of the future; it is a reconstruction of the past with the tools available today.
The back-test uses a two-tier PoS system:
Target validation multiplier:
| Prior Class Approvals | Multiplier | Rationale |
|---|---|---|
| 0 (unvalidated) | 0.60x | No proof this mechanism works in RA |
| 1 (single proof) | 1.0x | Baseline |
| 2+ (validated) | 1.15x | Multiple approvals confirm pathway |
Time-gated academic multipliers:
| Multiplier | Value | Available |
|---|---|---|
| Orphan Drug | 1.5x | Always |
| Biomarker Enrichment | 1.5x | After 2015 |
| Companion Diagnostic | 2.0x | After 2015 |
| Genetic Association | 2.6x | After 2024 |
Six risk flags are evaluated for each drug. Four affect PoS calculations via multiplicative adjustments; two are display-only informational flags.
| Flag | Multiplier | Trigger |
|---|---|---|
| SAFETY_CLASS_SIGNAL | 0.80x | Class safety concerns at decision date |
| LIMITED_TRIAL_DATA | 0.90x | <3 trials found |
| HIGH_COMPETITION | 0.90x | >5 same-class competitors |
| LATE_ENTRANT | 0.90x | >2 same-class drugs already approved |
| FIRST_IN_CLASS_RISK | display only | No prior approval in class |
| NOVEL_MODALITY | display only | <3 RA approvals for modality |
Which multipliers are allowed to score. Every scoring factor above adds a degree of freedom, so PhaseFolio holds each to a validation gate: a multiplier may score the engine only if a held-out cohort containing both approvals and failures can validate it; one that cannot is demoted to a display-only flag rather than allowed to move the number. (This report already separates four scoring flags from two display-only ones.) The gate is worked end-to-end on the antimicrobial cohort, where a pre-publication ablation demoted two of three candidate multipliers and we published the lower, defensible AUC of 0.629 rather than the most flattering 0.797. See the multiplier-governance gate and the antimicrobial Sprint-1 forensics.
Stage costs and durations are based on DiMasi et al. (2016) and Wouters et al. (2020) estimates, adjusted for inflation and phase-specific complexity. WACC is set at 10% (industry standard per Damodaran). Peak revenue estimates are sourced from analyst consensus at the decision date. All figures are expressed in nominal USD at the decision date.
HIGH — Structured data (PoS benchmarks, stage costs, WACC) comes from peer-reviewed academic sources. MEDIUM — Competitive density counts and target validation status are manually curated from FDA/CT.gov data. LOW — Peak revenue estimates rely on analyst consensus, which varies significantly by source and vintage.
ClinicalTrials.gov provides structured trial metadata (phase, status, enrollment, dates), but lacks the drug-level fields critical for computing transition rates: drug class, mechanism of action, molecular target, modality, published efficacy data, and FDA regulatory linkage. Intervention names are inconsistent ("Adalimumab" vs "adalimumab" vs "Humira"), and there is no way to determine which trials belong to the same drug program without domain knowledge.
| Data Source | Rows | Key Fields |
|---|---|---|
| ClinicalTrials.gov studies | 192,411 | NCT identifier, phase, recruitment status, study type, enrollment, dates |
| Trial condition mappings | 420,940 | NCT identifier, raw condition text, normalized indication |
| Trial intervention records | 424,618 | NCT identifier, intervention type, intervention name, normalized modality |
| FDA application records | 6,309 | application number, first approval date, normalized indication |
| FDA–trial cross-links | 1,879 | application number, NCT identifier, link method |
Filtering for RA (condition text matching "rheumatoid arthritis") identified 1,304 unique interventional trials across all phases.
Each trial was enriched through a systematic, multi-tier process designed to maximize data quality while preventing hallucination.
| Source | Data Provided | Confidence |
|---|---|---|
| ClinicalTrials.gov | Phase, status, enrollment, dates, sponsor, structured fields | High |
| FDA Drugs@FDA | Application numbers, approval dates, regulatory status | High |
| PubMed | Efficacy data, outcome summaries, safety findings | Medium |
| Web Search | Press releases, analyst reports, pipeline updates | Low |
Confidence score = weighted coverage across sources (0–1 scale). All 679 RA trials achieved "full" enrichment level (4 sources consulted).
Of the 1,304 raw RA trials, 625 were not enriched because they lacked drug-level metadata (non-drug interventions, unmappable entries, duplicate substudies). To verify this filtering was outcome-agnostic, we compared completion-to-termination ratios:
| Phase | Raw Completion Rate | Enriched Completion Rate | Difference |
|---|---|---|---|
| Phase 1 | 88.3% (166/188) | 87.8% (79/90) | -0.5pp |
| Phase 2 | 77.8% (242/311) | 77.3% (102/132) | -0.5pp |
| Phase 3 | 91.6% (285/311) | 91.7% (232/253) | +0.1pp |
| Phase 4 | 85.1% (149/175) | 83.1% (108/130) | -2.0pp |
No survivorship bias. Completion rates are virtually identical between raw and enriched datasets at every phase. The enrichment process removed trials by data availability, not by outcome.
| Metric | Value |
|---|---|
| Enriched RA trials | 679 |
| Distinct drugs | 71 |
| Drug classes | 32 |
| Columns per trial | 45 |
| Outcome summary coverage | 100% |
| Drug class / MoA / target coverage | 99.9% |
| FDA linkage | 73% |
| Patent data | 68% |
| Quantitative efficacy data | 55% |
| Drug-level transitions: P1→P2 | 37 drugs |
| Drug-level transitions: P2→P3 | 50 drugs |
| Drug-level transitions: P3→Approval | 35 drugs |
| Drug | Class | Sponsor | Decision Date | Decision Phase | FDA Approval |
|---|---|---|---|---|---|
| Adalimumab | TNF inhibitor | Abbott/AbbVie | Jan 1999 | Phase 2 | Dec 2002 |
| Etanercept | TNF inhibitor | Immunex/Amgen | Jan 1996 | Phase 2 | Nov 1998 |
| Rituximab | CD20 mAb | Genentech/Roche | Jan 2002 | Phase 2 | Feb 2006 |
| Abatacept | CTLA-4 fusion | BMS | Jan 2002 | Phase 2 | Dec 2005 |
| Tofacitinib | JAK inhibitor | Pfizer | Jan 2009 | Phase 2 | Nov 2012 |
| Baricitinib | JAK inhibitor | Lilly/Incyte | Jan 2013 | Phase 2 | Jun 2018 |
| Sarilumab | IL-6R mAb | Sanofi/Regeneron | Jan 2013 | Phase 2 | May 2017 |
| Upadacitinib | JAK inhibitor | AbbVie | Jan 2016 | Phase 2 | Aug 2019 |
| Drug | Class | Sponsor | Decision Date | Decision Phase | Failure Stage |
|---|---|---|---|---|---|
| Atacicept | BAFF/APRIL inhibitor | Merck Serono | Jan 2008 | Phase 1 | Phase 2 terminated |
| Tabalumab | BAFF mAb | Lilly | Jan 2012 | Phase 2 | Phase 3 failed |
| Fostamatinib | SYK inhibitor | Rigel | Jan 2010 | Phase 2 | Phase 3 failed |
| Ocrelizumab | CD20 mAb | Roche/Genentech | Jan 2007 | Phase 2 | Phase 3 terminated |
| Decernotinib | JAK3 inhibitor | Vertex | Jan 2014 | Phase 2 | Phase 3 not initiated |
| Vobarilizumab | IL-6R nanobody | Ablynx | Jan 2015 | Phase 2 | Phase 3 not initiated |
| Filgotinib | JAK1 inhibitor | Gilead/Galapagos | Jan 2019 | Phase 3 | FDA rejected |
| Peficitinib | JAK inhibitor | Astellas | Jan 2016 | Phase 3 | Not filed in US |
Drugs were selected to span the full history of RA targeted therapy (1996-2019), covering multiple modalities (small molecule, monoclonal antibody, fusion protein, nanobody) and mechanisms (TNF, IL-6, JAK, CD20, BAFF, SYK, CTLA-4). The 8/8 approved/failed split ensures balanced class representation. All drugs reached at least Phase 2 in RA (except atacicept, which entered at Phase 1), providing sufficient clinical data for reconstruction.
| Drug | Outcome | Decision Phase | PoS | rNPV | MC P50 | Risk Flags | Correct? |
|---|---|---|---|---|---|---|---|
| Adalimumab | Approved | Phase 2 | 57.8% | $573M | $1.0B | NOVEL_MODALITY LIMITED_TRIAL_DATA | Yes |
| Etanercept | Approved | Phase 2 | 44.4% | $272M | -$67M | FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA | Yes |
| Ocrelizumab | Failed | Phase 2 | 39.5% | $1.5B | -$116M | LIMITED_TRIAL_DATA SAFETY_CLASS_SIGNAL | No |
| Filgotinib | Failed | Phase 3 | 39.3% | $2.3B | -$29M | HIGH_COMPETITION NOVEL_MODALITY SAFETY_CLASS_SIGNAL | No |
| Rituximab | Approved | Phase 2 | 36.3% | $2.0B | -$102M | FIRST_IN_CLASS NOVEL_MODALITY | Yes |
| Sarilumab | Approved | Phase 2 | 31.6% | $689M | -$138M | (none) | Yes |
| Peficitinib | Failed | Phase 3 | 27.3% | $253M | -$28M | HIGH_COMPETITION NOVEL_MODALITY SAFETY_CLASS_SIGNAL | No |
| Fostamatinib | Failed | Phase 2 | 26.6% | $306M | -$134M | FIRST_IN_CLASS NOVEL_MODALITY | No |
| Abatacept | Approved | Phase 2 | 25.7% | $349M | -$120M | FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA | Yes |
| Tabalumab | Failed | Phase 2 | 25.1% | $495M | -$161M | FIRST_IN_CLASS | No |
| Tofacitinib | Approved | Phase 2 | 25.0% | $556M | -$144M | FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA | Yes |
| Baricitinib | Approved | Phase 2 | 24.4% | $303M | -$161M | NOVEL_MODALITY SAFETY_CLASS_SIGNAL | Yes |
| Decernotinib | Failed | Phase 2 | 13.7% | $116M | -$178M | NOVEL_MODALITY SAFETY_CLASS_SIGNAL | No |
| Upadacitinib | Approved | Phase 2 | 13.4% | $697M | -$199M | HIGH_COMPETITION NOVEL_MODALITY SAFETY_CLASS_SIGNAL | Yes |
| Vobarilizumab | Failed | Phase 2 | 11.7% | $103M | -$159M | FIRST_IN_CLASS NOVEL_MODALITY | No |
| Atacicept | Failed | Phase 1 | 7.9% | $6M | -$88M | FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA | Yes |
Note: "Correct direction" means rNPV sign matches outcome. All drugs have positive rNPV, so "correct" = approved. The real discrimination is in the PoS ranking, not rNPV sign — which is why phase-controlled AUC is the primary metric.
| Metric | Score | Target | Result |
|---|---|---|---|
| Pairwise AUC | 0.625 (40/64 pairs) | 0.60 | Pass |
| Phase-Controlled AUC | 0.65 | 0.55 | Pass |
| Separation Gap | +8.4pp (32.3% vs 23.9%) | 10pp | Fail |
| Risk Flag Sensitivity | 87.5% (7/8) | 70% | Pass |
| Risk Flag Enrichment | 1.0 (2.3 vs 2.3) | >1.0 | Fail |
| Directional Accuracy | 62.5% (40/64) | 60% | Pass |
| False Confidence (≥25%) | 50.0% (5/10) | <20% | Fail |
| False Confidence (≥60%) | 0% (0/0) | <20% | Pass |
| Best Threshold Accuracy | 62.5% at PoS 30% | -- | -- |
The pairwise AUC of 0.625 is the headline discrimination metric: it passes the 0.60 target and measures the probability that a randomly chosen eventual success carries a higher PoS than a randomly chosen failure. The phase-controlled AUC of 0.65 confirms the signal holds within decision phase, removing the structural advantage that earlier decisions have over later ones (fewer remaining stages = mechanically higher cumulative PoS). At n=16 this is an early directional signal — Wilson 95% accuracy intervals on the conventional and optimal cuts both include chance-level performance, so discrimination is suggestive, not confirmatory.
The honest counterweight: calibration and separation are weak at this sample size. The separation gap between success and failure means is only +8.4pp against a 10pp target, and the false-confidence rate at the 25% PoS cut is 50% (5 of 10 above-threshold calls were failures) against a 20% target. Both fail. The story is discrimination passing while calibration lags — the expected shape for a directionally sound model that needs a larger, multi-indication cohort before its absolute PoS levels can be trusted.
| PoS Cutoff | Accuracy | Precision | Recall | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|
| 30.0% (best) | 62.5% | 66.7% | 50.0% | 4 | 6 | 2 | 4 |
| 35.0% | 56.3% | 60.0% | 37.5% | 3 | 6 | 2 | 5 |
| 40.0% | 62.5% | 100.0% | 25.0% | 2 | 8 | 0 | 6 |
| 45.0% | 56.3% | 100.0% | 12.5% | 1 | 8 | 0 | 7 |
| 50.0% | 56.3% | 100.0% | 12.5% | 1 | 8 | 0 | 7 |
Atacicept received the lowest PoS in the cohort (7.9%) with 3 risk flags and a 0.60x target validation multiplier (no prior BAFF/APRIL approvals in RA). The Monte Carlo distribution heavily skewed negative: P10 = -$244M, P90 = -$31M, with 92.1% probability of negative outcome.
Outcome: Phase 2 terminated due to severe immunoglobulin reduction and fatal infections. The model correctly identified this as the highest-risk drug in the cohort.
Why this works: Atacicept combined an unvalidated mechanism (0.60x), a novel modality with no RA track record, limited trial data, and an early decision phase (Phase 1). Every signal aligned in the same direction — the model's conviction matched reality.
Filgotinib carried one of the highest PoS values (39.3%) among the failed drugs. The model flagged HIGH_COMPETITION and SAFETY_CLASS_SIGNAL, but the 39% PoS — driven by the validated JAK pathway (tofacitinib and baricitinib already approved) — placed it above several successful drugs in the ranking.
Outcome: FDA rejected over testicular toxicity concerns — a drug-specific safety signal that class-level modeling cannot capture. The SAFETY_CLASS_SIGNAL flag was present (reflecting the JAK class's known cardiovascular and thrombotic risks), but the specific reproductive toxicity was unique to filgotinib.
Model limitation: Class-level safety flags capture systemic risks (e.g., JAK inhibitors and cardiovascular events), but drug-specific toxicities remain outside the model's scope. This is inherent to any model that operates at the mechanism level rather than the molecule level.
A central methodological choice in this back-test is replacing static BIO/QLS NDA/BLA transition rates with rates computed from the enrichment corpus. This is not a refinement — it is a fundamentally different measurement.
| Source | NDA/BLA Rate | What It Measures |
|---|---|---|
| BIO/QLS 2021 | 91% | "Given filing, did NDA succeed?" (regulatory rubber-stamp rate) |
| Computed (enrichment corpus) | ~42% | "Given Phase 3, did drug get FDA approval?" (real-world outcome rate) |
The BIO/QLS rate of 91% measures a near-certainty: once a company files an NDA, it almost always gets approved. But the investment decision happens before filing — often years before. The relevant question is whether a drug in Phase 3 will ever reach and pass the NDA stage. Many drugs complete Phase 3 but never file (commercial viability, safety signals, competitive landscape shifts). The computed rate captures this full attrition.
Combined with drug-level counting (tracking individual drugs across phases, not trial counts) and time-gating (only using data available at decision date), this is a central source of the model's discriminative signal — pairwise AUC 0.625 and phase-controlled AUC 0.65.
Production status (as of this writing): the computed indication-specific transition rates described in this section are a research approach. Current production uses static BIO/QLS 2021 base rates.
Enriched trials data: 679 trials, 71 drugs, 45 structured columns. Drug-level transitions: P1 to P2 (37 drugs), P2 to P3 (50 drugs), P3 to Approval (35 drugs). 3-tier fallback: drug-class (n>=5) then RA-overall then BIO/QLS 2021.
| PoS Bucket | Drugs | Predicted Midpoint | Actual Success Rate | Gap |
|---|---|---|---|---|
| 0-15% | 4 | 7.5% | 25.0% | 17.5pp |
| 15-30% | 6 | 22.5% | 50.0% | 27.5pp |
| 30-50% | 5 | 40.0% | 60.0% | 20.0pp |
| 50%+ | 1 | 75.0% | 100.0% | 25.0pp |
With 16 drugs, calibration buckets are sparse. The model systematically underestimates PoS for drugs that succeed and overestimates for drugs that fail — which is consistent with a conservative model. Cross-indication expansion will improve statistical power.