PhaseFolio Validation Study

Back-Test Results: NSCLC Drug Cohort

59 historical non-small-cell lung cancer drugs (Phase 2 entrants 1979–2024) evaluated against PhaseFolio's rNPV engine. Pairwise AUC of 0.709 across 738 ranking pairs (523 concordant) is the strongest discrimination signal in the published cohorts. Absolute calibration trails discrimination — an honest consequence of registry-survivor cohort construction, disclosed below.

2026-05-16 · 59 drugs (41 approved, 18 failed) · 10,000 MC iterations per drug

Pairwise AUC

0.709

target ≥0.60 · 523/738 concordant pairs

PASS

Phase-Controlled AUC

0.709

target ≥0.55 · controls for structural NDA/BLA advantage

PASS

Risk Flag Sensitivity

100%

18/18 failures flagged · target ≥70%

PASS

Separation Gap

4.7pp

successes 8.3% vs failures 3.6% · target ≥10pp

WATCH

Key finding: Pairwise AUC of 0.709 on 738 ranking pairs is the strongest discrimination signal in the published PhaseFolio cohorts — the engine ranks NSCLC successes above failures 70.9% of the time. Phase-controlled AUC matches at 0.709, ruling out structural NDA/BLA advantage. Risk-flag sensitivity hits 100% — every one of the 18 failed drugs carried at least one model-emitted risk flag at decision time.

Data Foundation

How We Built the Dataset

Raw ClinicalTrials.gov data lacks the drug-level structure needed for decision-point reconstruction. The NSCLC enrichment pipeline transformed 5,167 raw NSCLC trials into a structured cohort with mechanism, target, FDA linkage, and outcome data.

Ingest Raw CT.gov Data

192,411 interventional studies ingested via the ClinicalTrials.gov API. Linked condition mappings (420K rows) and intervention data (424K rows) are retained in the platform's data store.

Filter for NSCLC

5,167 unique NSCLC trials identified by condition text matching across Phase 1 through Phase 4, with enrollment windows spanning 1979 to 2024.

Cross-Reference 4 Data Sources

Each trial enriched by AI agent cross-referencing: ClinicalTrials.gov (structured fields), FDA Drugs@FDA (regulatory data + approval dates), PubMed (published efficacy), and web search (press releases, analyst reports). Confidence score computed per trial.

Drug-Class & MoA Mapping

Pharmacology mapping per drug: drug class (100% coverage), mechanism of action (57.8%), FDA application linkage (39.1%), modality, target. A curated seed list of 91 NSCLC drug entries anchored the canonical mapping.

Cohort Derivation (Approvals + Failures)

Approvals auto-derived from the platform's commercial-profile dataset where an NSCLC indication and FDA approval date are present (41 drugs). Failures manually curated from terminated Phase 2/3 NSCLC programs in the enrichment corpus (18 drugs). Combined cohort: 59 drugs.

Survivor Bias Verification

Completion-to-termination ratios compared between raw CT.gov NSCLC (5,167) and the enriched dataset. Survivorship gap ≤2.3pp at every clinical phase — confirming the enrichment process did not selectively retain successful trials.

5,167

NSCLC Trials

Cohort Drugs

41 / 18

Approved / Failed

738

Ranking Pairs

Drug Class Coverage100%

Mechanism of Action57.8%

FDA Application Linkage39.1%

Quantitative Efficacy Data5.3%

Survivor bias verified within ≤2.3pp at every phase: completion-to-termination ratios in the enriched 5,167-trial dataset match the raw CT.gov NSCLC corpus across Phase 1, 2, 3, and 4. The cohort itself remains a registry-survivor subset of the universe of all programs that ever entered Phase 2 — programs that died before public disclosure are unrepresented. This is a property of the source data, not the enrichment pass; it inflates observed approval rates upward in calibration plots independent of engine accuracy (see Limitations).

Methodology

How the Back-Test Works

Each drug is evaluated using only information available before its real-world decision point. No future data leaks into the model.

Reconstruct Decision Point

For each drug, the decision date is the earliest Phase 3 NSCLC trial start minus 12 months (or the FDA approval date − 4 years where no Phase 3 trial registered). Defines the information frontier the model is allowed to see.

Apply BIO/QLS Base Rates

Phase-by-phase transition probabilities sourced from BIO/QLS 2021 oncology cohort (12,728+ stage transitions). Same base table used in production rNPV.

Apply Modifiers via Logistic Path

Genetic validation, biomarker strategy, orphan designation, first-in-class flags applied through the log-odds path to keep PoS bounded in [0,1]. Multipliers gated by source-publication date so post-decision evidence cannot leak.

Risk-Flag Emission

Generic risk flags (HIGH_COMPETITION, LIMITED_TRIAL_DATA, FIRST_IN_CLASS_RISK) emitted from enrichment-corpus counts and cohort metadata at the decision date. Class-specific NSCLC risk tables not yet populated.

Run rNPV Engine + Monte Carlo

10,000 iterations per drug with Bernoulli stage gates. Same production engine used by PhaseFolio customers. Per-drug output: predicted cumulative PoS, rNPV, eNPV, MC percentiles.

Score Against Actual Outcomes

Pairwise AUC over all 41×18 = 738 success/failure pairs. Phase-controlled AUC within Phase 2 (the only phase with both successes and failures in the cohort). Risk-flag sensitivity = fraction of failures flagged at decision.

Which multipliers are allowed to score. Each modifier in step 3 adds a degree of freedom, so PhaseFolio holds every scoring multiplier to a validation gate: a factor may move a probability only if a held-out cohort containing both approvals and failures can validate it; one that cannot is demoted to a non-scored, display-only risk flag. The gate is worked end-to-end on the antimicrobial cohort, where a pre-publication ablation demoted two of three candidate multipliers and the published scored AUC is the defensible 0.629 from the one validatable factor, not the uncheckable 0.797 the unvalidated pair would have shown. See the multiplier-governance gate and the antimicrobial Sprint-1 forensics.

Results

Predicted Cumulative PoS by Drug

Bars show the model's predicted cumulative probability of success at the decision point, sorted within group. Top 12 of 41 approved + top 12 of 18 failed shown for readability; full 59-drug cohort table follows.

Approved — top 12 of 41

etoposideVePesid · Topoisomerase II inhibitor

33.4%

gemcitabineGemzar · Nucleoside analog chemotherapy

25.4%

docetaxelTaxotere · Taxane chemotherapy

21.0%

cisplatinPlatinol · Platinum chemotherapy

19.4%

carboplatinParaplatin · Platinum chemotherapy

19.4%

paclitaxelTaxol · Taxane chemotherapy

19.4%

osimertinibTagrisso · EGFR tyrosine kinase inhibitor

18.0%

erlotinibTarceva · EGFR tyrosine kinase inhibitor

15.7%

vinorelbineNavelbine · Vinca alkaloid chemotherapy

12.7%

bevacizumabAvastin · Anti-VEGF biologic

12.6%

pemetrexedAlimta · Antifolate chemotherapy

11.9%

gefitinibIressa · EGFR tyrosine kinase inhibitor

11.7%

Failed — top 12 of 18

mage-a3 vaccineMAGE-A3 ASCI · MAGE-A3 cancer vaccine

6.8%

figitumumabFigitumumab (CP-751,871) · Anti-IGF-1R antibody

5.4%

aflibercept (nsclc)Zaltrap (NSCLC arm) / approved Zaltrap CRC is separate · VEGF trap (recombinant fusion protein)

5.4%

cabiralizumabFPA008 · Anti-CSF1R antibody (TAM modulation)

5.2%

belagenpumatucel-lLucanix · TGF-beta2 antisense allogeneic tumor…

4.8%

veliparib (nsclc)Veliparib (NSCLC) · PARP inhibitor

4.3%

cixutumumabIMC-A12 · Anti-IGF-1R antibody

4.0%

dalotuzumabMK-0646 · Anti-IGF-1R antibody

4.0%

rociletinibCO-1686 · 3rd-gen EGFR T790M TKI

3.7%

selumetinib (nsclc)Selumetinib (NSCLC arm) / later Koselugo (different indication) · MEK1/2 inhibitor

3.3%

stimuvaxStimuvax (tecemotide / L-BLP25) · MUC1 cancer vaccine

3.1%

talactoferrinTalactoferrin alfa · Recombinant lactoferrin oral immunom…

3.1%

Mean PoS (approved): 8.3% · Mean PoS (failed): 3.6% · Separation: +4.7pp · Pairwise AUC: 0.709

Scorecard

Aggregate Accuracy Metrics

The full validation scorecard — passes and fails. The engine is strong on the metrics that matter for a ranking screen (discrimination, failure flagging, no false confidence) and weak on absolute-level metrics (separation gap, threshold accuracy), which is the under-prediction documented in Calibration, below.

Metric	Score	Target	Result
Pairwise AUC	0.709 (523/738 pairs)	≥0.60	Pass
Phase-Controlled AUC	0.709	≥0.55	Pass
Separation Gap	+4.7pp (8.3% vs 3.6%)	≥10pp	Fail
Risk Flag Sensitivity	100% (18/18)	≥70%	Pass
Risk Flag Enrichment	1.04 (3.0 vs 2.9)	>1.0	Pass
False Confidence (>25% PoS)	0% (0/2)	<20%	Pass
Best Threshold Accuracy	32.2% at PoS 30%	—	—

Go / No-Go Threshold Analysis — why we rank, not threshold

A binary “invest if predicted PoS ≥ cutoff” rule performs poorly on this cohort: best accuracy 32.2% at a 30% cutoff. Because the engine predicts almost every drug below any reasonable cutoff, the rule correctly passes on all 18 failures but flags only 1 of 41 eventual approvals as “invest.” The actionable signal here is the ranking (pairwise AUC 0.709), not an absolute cutoff — the same reason absolute calibration trails discrimination.

PoS Cutoff	Accuracy	Precision	Recall	TP	TN	FN
30% (best)	32.2%	100%	2.4%	1	18	40
35%	30.5%	0%	0.0%	0	18	41
40%	30.5%	0%	0.0%	0	18	41
42%	30.5%	0%	0.0%	0	18	41
45%	30.5%	0%	0.0%	0	18	41
48%	30.5%	0%	0.0%	0	18	41
50%	30.5%	0%	0.0%	0	18	41
55%	30.5%	0%	0.0%	0	18	41

Cohort

59-Drug NSCLC Back-Test Cohort

Drug	Brand	Mechanism	Outcome
etoposide	VePesid	Topoisomerase II inhibitor	Approved
gemcitabine	Gemzar	Nucleoside analog chemotherapy	Approved
docetaxel	Taxotere	Taxane chemotherapy	Approved
cisplatin	Platinol	Platinum chemotherapy	Approved
carboplatin	Paraplatin	Platinum chemotherapy	Approved
paclitaxel	Taxol	Taxane chemotherapy	Approved
osimertinib	Tagrisso	EGFR tyrosine kinase inhibitor	Approved
erlotinib	Tarceva	EGFR tyrosine kinase inhibitor	Approved
vinorelbine	Navelbine	Vinca alkaloid chemotherapy	Approved
bevacizumab	Avastin	Anti-VEGF biologic	Approved
pemetrexed	Alimta	Antifolate chemotherapy	Approved
gefitinib	Iressa	EGFR tyrosine kinase inhibitor	Approved
mobocertinib	Exkivity	EGFR exon 20 insertion inhibitor	Approved
cemiplimab	Libtayo	Anti-PD-1 checkpoint inhibitor	Approved
lorlatinib	Lorbrena	ALK tyrosine kinase inhibitor	Approved
brigatinib	Alunbrig	ALK tyrosine kinase inhibitor	Approved
selpercatinib	Retevmo	RET selective inhibitor	Approved
capmatinib	Tabrecta	MET tyrosine kinase inhibitor	Approved
amivantamab	Rybrevant	EGFR/MET bispecific antibody	Approved
tepotinib	Tepmetko	MET tyrosine kinase inhibitor	Approved
entrectinib	Rozlytrek	ROS1/NTRK tyrosine kinase inhibitor	Approved
trastuzumab deruxtecan	Enhertu	HER2-directed ADC	Approved
larotrectinib	Vitrakvi	NTRK kinase inhibitor	Approved
ipilimumab	Yervoy	CTLA-4 checkpoint inhibitor	Approved
pralsetinib	Gavreto	RET selective inhibitor	Approved
crizotinib	Xalkori	ALK tyrosine kinase inhibitor	Approved
adagrasib	Krazati	KRAS G12C inhibitor	Approved
dacomitinib	Vizimpro	Pan-HER tyrosine kinase inhibitor	Approved
sotorasib	Lumakras	KRAS G12C inhibitor	Approved
tislelizumab	Tevimbra	PD-1 checkpoint inhibitor	Approved
afatinib	Gilotrif	EGFR tyrosine kinase inhibitor	Approved
ramucirumab	Cyramza	Anti-VEGFR2 monoclonal antibody	Approved
necitumumab	Portrazza	Anti-EGFR monoclonal antibody	Approved
datopotamab deruxtecan	Datroway	TROP2-directed ADC	Approved
pembrolizumab	Keytruda	Anti-PD-1 checkpoint inhibitor	Approved
ceritinib	Zykadia	ALK tyrosine kinase inhibitor	Approved
atezolizumab	Tecentriq	Anti-PD-L1 checkpoint inhibitor	Approved
durvalumab	Imfinzi	Anti-PD-L1 checkpoint inhibitor	Approved
alectinib	Alecensa	ALK tyrosine kinase inhibitor	Approved
nab-paclitaxel	Abraxane	Taxane chemotherapy (albumin-bound)	Approved
nivolumab	Opdivo	Anti-PD-1 checkpoint inhibitor	Approved
mage-a3 vaccine	MAGE-A3 ASCI	MAGE-A3 cancer vaccine	Failed (Ph 3)
figitumumab	Figitumumab (CP-751,871)	Anti-IGF-1R antibody	Failed (Ph 3)
aflibercept (nsclc)	Zaltrap (NSCLC arm) / approved Zaltrap CRC is separate	VEGF trap (recombinant fusion protein)	Failed (Ph 3)
cabiralizumab	FPA008	Anti-CSF1R antibody (TAM modulation)	Failed (Ph 2)
belagenpumatucel-l	Lucanix	TGF-beta2 antisense allogeneic tumor…	Failed (Ph 3)
veliparib (nsclc)	Veliparib (NSCLC)	PARP inhibitor	Failed (Ph 3)
cixutumumab	IMC-A12	Anti-IGF-1R antibody	Failed (Ph 2)
dalotuzumab	MK-0646	Anti-IGF-1R antibody	Failed (Ph 2)
rociletinib	CO-1686	3rd-gen EGFR T790M TKI	Failed (Ph 2)
selumetinib (nsclc)	Selumetinib (NSCLC arm) / later Koselugo (different indication)	MEK1/2 inhibitor	Failed (Ph 3)
stimuvax	Stimuvax (tecemotide / L-BLP25)	MUC1 cancer vaccine	Failed (Ph 3)
talactoferrin	Talactoferrin alfa	Recombinant lactoferrin oral immunom…	Failed (Ph 3)
demcizumab	OMP-21M18	Anti-DLL4 antibody (Notch pathway)	Failed (Ph 2)
bavituximab	Bavituximab (PGN401)	Anti-phosphatidylserine antibody	Failed (Ph 3)
custirsen	OGX-011	Clusterin antisense oligonucleotide	Failed (Ph 3)
ganetespib	STA-9090	HSP90 inhibitor	Failed (Ph 3)
patritumab	U3-1287 / patritumab	Anti-HER3 antibody	Failed (Ph 2)
tergenpumatucel-l	HyperAcute Lung	Allogeneic whole-cell vaccine	Failed (Ph 2)

Case Studies

Deep Dives

Strongest No-Go Signal

Tergenpumatucel-L

Allogeneic whole-cell NSCLC vaccine · NewLink Genetics · Decision: 2009

PhaseFolio assigned the lowest cumulative PoS in the cohort (1%) with multiple risk flags reflecting first-in-class allogeneic-vaccine modality, limited prior precedent in NSCLC, and the small clinical footprint visible at decision time. Monte Carlo distribution skewed heavily negative.

Actual outcome: Phase 2 terminated for lack of overall-survival benefit; program discontinued. Vaccine modality has yet to deliver an NSCLC approval.

Predicted PoS

Failed

Phase 2

0.709

Cohort AUC

Honest Limitation

Checkpoint Inhibitors Under-Predicted

Anti-PD-1 / PD-L1 modality class · Identified in this cohort

Nivolumab (Opdivo), atezolizumab (Tecentriq), and durvalumab (Imfinzi) each received a predicted cumulative PoS of 1–2% at their NSCLC Phase 2 decision points — placing them at the bottom of the ranking. All three approved. The model penalized them via the cumulative-LoA prior on the “biologic” modality bucket without giving credit for the structural shift checkpoint-inhibitor class success rates were producing in oncology.

We disclose this rather than hide it: pairwise AUC stays strong because most checkpoint-inhibitor approvals still rank above the model's correctly-flagged failures. Class-specific NSCLC modifier tables (analogous to RA's anti-TNF / JAK / IL-6 tables) are the planned correction.

1–2%

Predicted PoS

3 / 3

Approved (FDA)

Open

Modifier Table

Calibration

Discrimination vs. Absolute Calibration

Discrimination (does the engine rank successes above failures?) and absolute calibration (does a predicted 20% correspond to a 20% real-world rate?) answer different questions and have different sensitivities to how the cohort was built. NSCLC discrimination is strong — pairwise AUC 0.709. Absolute calibration shows large positive gaps in every bucket: predicted midpoints sit well below observed approval rates.

Predicted PoS Bucket	Drugs	Predicted Midpoint	Actual Approval Rate	Gap
0-15%	51	7.5%	64.7%	+57.2pp
15-30%	7	22.5%	100.0%	+77.5pp
30-50%	1	40.0%	100.0%	+60.0pp

The gap is primarily a cohort-construction artifact, not an engine error: this is a registry-survivor set of Phase 2 entrants, whose observed approval rate is inflated relative to the population base rate the engine is calibrated to — programs that died before public disclosure are unrepresented. The checkpoint-inhibitor under-prediction documented above is a second contributor. Crucially, the failures concentrate in the lowest predicted-PoS bucket, which is why ranking stays reliable (AUC 0.709) even though absolute levels are shifted upward. We lead with discrimination because it is robust to this level shift; absolute calibration at this cohort size and construction is not. See the backtest methodology for the full discrimination-vs-calibration framing.

Limitations

Discrimination strong, calibration trails. Pairwise AUC of 0.709 validates that the engine ranks NSCLC successes above failures reliably. The separation gap (4.7pp between mean predicted PoS for successes vs failures) and the 0–15% calibration bucket (51 drugs predicted, 64.7% actual approval rate) reflect two distinct effects: (1) cohort survivor bias from registry-visible Phase 2 entrants inflates observed approval rates above the population base rate the engine is calibrated to, and (2) the checkpoint-inhibitor class shift documented in the case study. Class-specific NSCLC modifier tables and cohort expansion to registry-invisible Phase 2 programs are the planned next steps. See the backtest methodology for full discrimination-vs-calibration framing.

Engine version: PhaseFolio rNPV engine 1.0.0 (base BIO/QLS PoS path; published AUC unchanged through 2.6.0) · substrate methodology version: methodology@2026-07-28-v2 · cohort built by the PhaseFolio AI enrichment pipeline (Claude agents cross-referencing ClinicalTrials.gov, FDA Drugs@FDA, PubMed, and web sources; no human medical officer), anchored to a 91-entry curated drug seed and survivor-bias-verified within ≤2.3pp at every phase against the raw CT.gov NSCLC corpus.