59 historical non-small-cell lung cancer drugs (Phase 2 entrants 1979–2024) evaluated against PhaseFolio's rNPV engine. Pairwise AUC of 0.709 across 738 ranking pairs (523 concordant) is the strongest discrimination signal in the published cohorts. Absolute calibration trails discrimination — an honest consequence of registry-survivor cohort construction, disclosed below.
Key finding: Pairwise AUC of 0.709 on 738 ranking pairs is the strongest discrimination signal in the published PhaseFolio cohorts — the engine ranks NSCLC successes above failures 70.9% of the time. Phase-controlled AUC matches at 0.709, ruling out structural NDA/BLA advantage. Risk-flag sensitivity hits 100% — every one of the 18 failed drugs carried at least one model-emitted risk flag at decision time.
Raw ClinicalTrials.gov data lacks the drug-level structure needed for decision-point reconstruction. The NSCLC enrichment pipeline transformed 5,167 raw NSCLC trials into a structured cohort with mechanism, target, FDA linkage, and outcome data.
Survivor bias verified within ≤2.3pp at every phase: completion-to-termination ratios in the enriched 5,167-trial dataset match the raw CT.gov NSCLC corpus across Phase 1, 2, 3, and 4. The cohort itself remains a registry-survivor subset of the universe of all programs that ever entered Phase 2 — programs that died before public disclosure are unrepresented. This is a property of the source data, not the enrichment pass; it inflates observed approval rates upward in calibration plots independent of engine accuracy (see Limitations).
Each drug is evaluated using only information available before its real-world decision point. No future data leaks into the model.
Which multipliers are allowed to score. Each modifier in step 3 adds a degree of freedom, so PhaseFolio holds every scoring multiplier to a validation gate: a factor may move a probability only if a held-out cohort containing both approvals and failures can validate it; one that cannot is demoted to a non-scored, display-only risk flag. The gate is worked end-to-end on the antimicrobial cohort, where a pre-publication ablation demoted two of three candidate multipliers and the published scored AUC is the defensible 0.629 from the one validatable factor, not the uncheckable 0.797 the unvalidated pair would have shown. See the multiplier-governance gate and the antimicrobial Sprint-1 forensics.
Bars show the model's predicted cumulative probability of success at the decision point, sorted within group. Top 12 of 41 approved + top 12 of 18 failed shown for readability; full 59-drug cohort table follows.
The full validation scorecard — passes and fails. The engine is strong on the metrics that matter for a ranking screen (discrimination, failure flagging, no false confidence) and weak on absolute-level metrics (separation gap, threshold accuracy), which is the under-prediction documented in Calibration, below.
| Metric | Score | Target | Result |
|---|---|---|---|
| Pairwise AUC | 0.709 (523/738 pairs) | ≥0.60 | Pass |
| Phase-Controlled AUC | 0.709 | ≥0.55 | Pass |
| Separation Gap | +4.7pp (8.3% vs 3.6%) | ≥10pp | Fail |
| Risk Flag Sensitivity | 100% (18/18) | ≥70% | Pass |
| Risk Flag Enrichment | 1.04 (3.0 vs 2.9) | >1.0 | Pass |
| False Confidence (>25% PoS) | 0% (0/2) | <20% | Pass |
| Best Threshold Accuracy | 32.2% at PoS 30% | — | — |
A binary “invest if predicted PoS ≥ cutoff” rule performs poorly on this cohort: best accuracy 32.2% at a 30% cutoff. Because the engine predicts almost every drug below any reasonable cutoff, the rule correctly passes on all 18 failures but flags only 1 of 41 eventual approvals as “invest.” The actionable signal here is the ranking (pairwise AUC 0.709), not an absolute cutoff — the same reason absolute calibration trails discrimination.
| PoS Cutoff | Accuracy | Precision | Recall | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|
| 30% (best) | 32.2% | 100% | 2.4% | 1 | 18 | 0 | 40 |
| 35% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| 40% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| 42% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| 45% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| 48% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| 50% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| 55% | 30.5% | 0% | 0.0% | 0 | 18 | 0 | 41 |
| Drug | Brand | Mechanism | Outcome |
|---|---|---|---|
| etoposide | VePesid | Topoisomerase II inhibitor | Approved |
| gemcitabine | Gemzar | Nucleoside analog chemotherapy | Approved |
| docetaxel | Taxotere | Taxane chemotherapy | Approved |
| cisplatin | Platinol | Platinum chemotherapy | Approved |
| carboplatin | Paraplatin | Platinum chemotherapy | Approved |
| paclitaxel | Taxol | Taxane chemotherapy | Approved |
| osimertinib | Tagrisso | EGFR tyrosine kinase inhibitor | Approved |
| erlotinib | Tarceva | EGFR tyrosine kinase inhibitor | Approved |
| vinorelbine | Navelbine | Vinca alkaloid chemotherapy | Approved |
| bevacizumab | Avastin | Anti-VEGF biologic | Approved |
| pemetrexed | Alimta | Antifolate chemotherapy | Approved |
| gefitinib | Iressa | EGFR tyrosine kinase inhibitor | Approved |
| mobocertinib | Exkivity | EGFR exon 20 insertion inhibitor | Approved |
| cemiplimab | Libtayo | Anti-PD-1 checkpoint inhibitor | Approved |
| lorlatinib | Lorbrena | ALK tyrosine kinase inhibitor | Approved |
| brigatinib | Alunbrig | ALK tyrosine kinase inhibitor | Approved |
| selpercatinib | Retevmo | RET selective inhibitor | Approved |
| capmatinib | Tabrecta | MET tyrosine kinase inhibitor | Approved |
| amivantamab | Rybrevant | EGFR/MET bispecific antibody | Approved |
| tepotinib | Tepmetko | MET tyrosine kinase inhibitor | Approved |
| entrectinib | Rozlytrek | ROS1/NTRK tyrosine kinase inhibitor | Approved |
| trastuzumab deruxtecan | Enhertu | HER2-directed ADC | Approved |
| larotrectinib | Vitrakvi | NTRK kinase inhibitor | Approved |
| ipilimumab | Yervoy | CTLA-4 checkpoint inhibitor | Approved |
| pralsetinib | Gavreto | RET selective inhibitor | Approved |
| crizotinib | Xalkori | ALK tyrosine kinase inhibitor | Approved |
| adagrasib | Krazati | KRAS G12C inhibitor | Approved |
| dacomitinib | Vizimpro | Pan-HER tyrosine kinase inhibitor | Approved |
| sotorasib | Lumakras | KRAS G12C inhibitor | Approved |
| tislelizumab | Tevimbra | PD-1 checkpoint inhibitor | Approved |
| afatinib | Gilotrif | EGFR tyrosine kinase inhibitor | Approved |
| ramucirumab | Cyramza | Anti-VEGFR2 monoclonal antibody | Approved |
| necitumumab | Portrazza | Anti-EGFR monoclonal antibody | Approved |
| datopotamab deruxtecan | Datroway | TROP2-directed ADC | Approved |
| pembrolizumab | Keytruda | Anti-PD-1 checkpoint inhibitor | Approved |
| ceritinib | Zykadia | ALK tyrosine kinase inhibitor | Approved |
| atezolizumab | Tecentriq | Anti-PD-L1 checkpoint inhibitor | Approved |
| durvalumab | Imfinzi | Anti-PD-L1 checkpoint inhibitor | Approved |
| alectinib | Alecensa | ALK tyrosine kinase inhibitor | Approved |
| nab-paclitaxel | Abraxane | Taxane chemotherapy (albumin-bound) | Approved |
| nivolumab | Opdivo | Anti-PD-1 checkpoint inhibitor | Approved |
| mage-a3 vaccine | MAGE-A3 ASCI | MAGE-A3 cancer vaccine | Failed (Ph 3) |
| figitumumab | Figitumumab (CP-751,871) | Anti-IGF-1R antibody | Failed (Ph 3) |
| aflibercept (nsclc) | Zaltrap (NSCLC arm) / approved Zaltrap CRC is separate | VEGF trap (recombinant fusion protein) | Failed (Ph 3) |
| cabiralizumab | FPA008 | Anti-CSF1R antibody (TAM modulation) | Failed (Ph 2) |
| belagenpumatucel-l | Lucanix | TGF-beta2 antisense allogeneic tumor… | Failed (Ph 3) |
| veliparib (nsclc) | Veliparib (NSCLC) | PARP inhibitor | Failed (Ph 3) |
| cixutumumab | IMC-A12 | Anti-IGF-1R antibody | Failed (Ph 2) |
| dalotuzumab | MK-0646 | Anti-IGF-1R antibody | Failed (Ph 2) |
| rociletinib | CO-1686 | 3rd-gen EGFR T790M TKI | Failed (Ph 2) |
| selumetinib (nsclc) | Selumetinib (NSCLC arm) / later Koselugo (different indication) | MEK1/2 inhibitor | Failed (Ph 3) |
| stimuvax | Stimuvax (tecemotide / L-BLP25) | MUC1 cancer vaccine | Failed (Ph 3) |
| talactoferrin | Talactoferrin alfa | Recombinant lactoferrin oral immunom… | Failed (Ph 3) |
| demcizumab | OMP-21M18 | Anti-DLL4 antibody (Notch pathway) | Failed (Ph 2) |
| bavituximab | Bavituximab (PGN401) | Anti-phosphatidylserine antibody | Failed (Ph 3) |
| custirsen | OGX-011 | Clusterin antisense oligonucleotide | Failed (Ph 3) |
| ganetespib | STA-9090 | HSP90 inhibitor | Failed (Ph 3) |
| patritumab | U3-1287 / patritumab | Anti-HER3 antibody | Failed (Ph 2) |
| tergenpumatucel-l | HyperAcute Lung | Allogeneic whole-cell vaccine | Failed (Ph 2) |
Discrimination (does the engine rank successes above failures?) and absolute calibration (does a predicted 20% correspond to a 20% real-world rate?) answer different questions and have different sensitivities to how the cohort was built. NSCLC discrimination is strong — pairwise AUC 0.709. Absolute calibration shows large positive gaps in every bucket: predicted midpoints sit well below observed approval rates.
| Predicted PoS Bucket | Drugs | Predicted Midpoint | Actual Approval Rate | Gap |
|---|---|---|---|---|
| 0-15% | 51 | 7.5% | 64.7% | +57.2pp |
| 15-30% | 7 | 22.5% | 100.0% | +77.5pp |
| 30-50% | 1 | 40.0% | 100.0% | +60.0pp |
The gap is primarily a cohort-construction artifact, not an engine error: this is a registry-survivor set of Phase 2 entrants, whose observed approval rate is inflated relative to the population base rate the engine is calibrated to — programs that died before public disclosure are unrepresented. The checkpoint-inhibitor under-prediction documented above is a second contributor. Crucially, the failures concentrate in the lowest predicted-PoS bucket, which is why ranking stays reliable (AUC 0.709) even though absolute levels are shifted upward. We lead with discrimination because it is robust to this level shift; absolute calibration at this cohort size and construction is not. See the backtest methodology for the full discrimination-vs-calibration framing.
Discrimination strong, calibration trails. Pairwise AUC of 0.709 validates that the engine ranks NSCLC successes above failures reliably. The separation gap (4.7pp between mean predicted PoS for successes vs failures) and the 0–15% calibration bucket (51 drugs predicted, 64.7% actual approval rate) reflect two distinct effects: (1) cohort survivor bias from registry-visible Phase 2 entrants inflates observed approval rates above the population base rate the engine is calibrated to, and (2) the checkpoint-inhibitor class shift documented in the case study. Class-specific NSCLC modifier tables and cohort expansion to registry-invisible Phase 2 programs are the planned next steps. See the backtest methodology for full discrimination-vs-calibration framing.
Engine version: PhaseFolio rNPV engine 1.0.0 (base BIO/QLS PoS path; published AUC unchanged through 2.6.0) · substrate methodology version: methodology@2026-06-17 · cohort built by the PhaseFolio AI enrichment pipeline (Claude agents cross-referencing ClinicalTrials.gov, FDA Drugs@FDA, PubMed, and web sources; no human medical officer), anchored to a 91-entry curated drug seed and survivor-bias-verified within ≤2.3pp at every phase against the raw CT.gov NSCLC corpus.