PhaseFolio Validation Study

Back-Test Report: Rheumatoid Arthritis Drug Cohort

A retrospective calibration cohort of PhaseFolio's rNPV engine against 16 historical RA drugs, using indication-specific transition rates from 679 curated clinical trials. AUC 0.625 is early directional signal at n=16, not a confirmatory result — Wilson 95% accuracy intervals span chance.

Date

2026-05-29

Cohort

16 drugs (8 approved, 8 failed)

Data

679 enriched trials, 71 drugs

Simulations

160,000 Monte Carlo iterations

1. Executive Summary

The model achieved a pairwise AUC of 0.625 (passing the 0.60 threshold), meaning a randomly chosen eventual success outranks a randomly chosen failure 62.5% of the time. The phase-controlled AUC of 0.65 (target 0.55) confirms the signal holds within decision phase, controlling for the structural advantage later-stage decisions carry. Risk flag sensitivity reached 87.5% (7/8 failures flagged). At the best operating point — a PoS cutoff of 30% — the model achieved 62.5% accuracy with 66.7% precision and 50% recall. Discrimination passes, but calibration is weak at n=16: the separation gap (+8.4pp) and the false-confidence rate at the 25% cut (50%) both fail their targets (see §6). Every PoS multiplier is held to a validation gate: a factor may score the engine only if a held-out cohort with both approvals and failures can confirm it, otherwise it is demoted to a display-only flag rather than moving the number (see §2.4).

0.625

Pairwise AUC

target: 0.60

0.65

Phase-Controlled AUC

target: 0.55

87.5%

Risk Flag Sensitivity

target: 70%

62.5%

Best-Threshold Accuracy

at PoS 30%

95% Wilson confidence intervals (n=16). Conventional ≥50% cut: 9/16 correct calls → 56.3% [33.2%–76.9%]. Optimal ≥30% cut: 10/16 correct calls → 62.5% [38.6%–81.5%]. Wilson is preferred over normal-approximation at small N because it does not produce nonsensical bounds at the extremes. AUC point estimates are reported without an interval here — small-N AUC requires a different methodology (DeLong or bootstrap), which we report in the methodology appendix rather than inline.

2. Methodology

2.1 Core Principle: No Future Information

The back-test simulates the decision an investor or founder would have faced at the time — using only information available at each drug's go/no-go moment. No post-hoc data (trial results, FDA decisions, commercial outcomes) leaks into the inputs. This is not a prediction of the future; it is a reconstruction of the past with the tools available today.

2.2 How the Back-Test Works

Curate clinical trial data

679 RA trials enriched from CT.gov + FDA + PubMed + web. 71 distinct drugs, 45 structured fields per trial in the enrichment corpus.

Compute drug-level transition rates

Time-gated rates from the enrichment corpus. Drug-level counting (did drug X advance?). 3-tier fallback: drug-class (n>=5) then RA-overall then BIO/QLS 2021.

Reconstruct the decision point

Identify what was known at each drug’s go/no-go moment. Phase completed, costs, competitive landscape, target validation history.

Apply target validation multiplier

Count prior FDA approvals in same drug class: 0 approvals = 0.60x, 1 = 1.0x, 2+ = 1.15x. Applied via logistic adjustment.

Adjust for competitive density

Count same-class competitors at decision date. 0-3: no adjustment, 4-6: 0.95x, 7-10: 0.90x, 11+: 0.85x.

Run the rNPV engine

Stage costs, durations, probability-weighted cash flows, peak revenue, WACC. Same production engine used by PhaseFolio customers.

Run Monte Carlo

10,000 iterations with rpNPV mode (Bernoulli stage gates). Produces P10/P50/P90 distribution and P(negative) probability.

Score against outcomes

Pairwise AUC, phase-controlled AUC, threshold sweep, risk flag metrics. Compare to known approval/failure outcomes.

2.3 PoS Sources

The back-test uses a two-tier PoS system:

Computed rates (primary): Drug-level transition rates from the enrichment corpus, time-gated. Used when n>=5 drugs at that phase.
BIO/QLS 2021 benchmarks (fallback): Static academic rates for immunology. Used for early decision dates (pre-2005) or small samples.

Target validation multiplier:

Prior Class Approvals	Multiplier	Rationale
0 (unvalidated)	0.60x	No proof this mechanism works in RA
1 (single proof)	1.0x	Baseline
2+ (validated)	1.15x	Multiple approvals confirm pathway

Time-gated academic multipliers:

Multiplier	Value	Available
Orphan Drug	1.5x	Always
Biomarker Enrichment	1.5x	After 2015
Companion Diagnostic	2.0x	After 2015
Genetic Association	2.6x	After 2024

2.4 Risk Flags

Six risk flags are evaluated for each drug. Four affect PoS calculations via multiplicative adjustments; two are display-only informational flags.

Flag	Multiplier	Trigger
SAFETY_CLASS_SIGNAL	0.80x	Class safety concerns at decision date
LIMITED_TRIAL_DATA	0.90x	<3 trials found
HIGH_COMPETITION	0.90x	>5 same-class competitors
LATE_ENTRANT	0.90x	>2 same-class drugs already approved
FIRST_IN_CLASS_RISK	display only	No prior approval in class
NOVEL_MODALITY	display only	<3 RA approvals for modality

Which multipliers are allowed to score. Every scoring factor above adds a degree of freedom, so PhaseFolio holds each to a validation gate: a multiplier may score the engine only if a held-out cohort containing both approvals and failures can validate it; one that cannot is demoted to a display-only flag rather than allowed to move the number. (This report already separates four scoring flags from two display-only ones.) The gate is worked end-to-end on the antimicrobial cohort, where a pre-publication ablation demoted two of three candidate multipliers and we published the lower, defensible AUC of 0.629 rather than the most flattering 0.797. See the multiplier-governance gate and the antimicrobial Sprint-1 forensics.

2.5 Data Sources

Stage costs and durations in this backtest are PhaseFolio-authored scenario assumptions. They were set manually, are not benchmark-derived, and should not be read as observed rheumatoid-arthritis cohort averages. WACC is set at 10% (industry standard per Damodaran). Peak revenue estimates are sourced from analyst consensus at the decision date. All figures are expressed in nominal USD at the decision date.

2.6 Confidence Tiers

HIGH — PoS benchmarks and the WACC reference come from published sources. MEDIUM — Competitive density counts and target validation status are manually curated from FDA/CT.gov data. LOW — Stage costs are manually set PhaseFolio assumptions, while peak revenue relies on analyst consensus; neither is an observed RA cohort benchmark.

3. Data Enrichment Pipeline

3.1 Why Raw CT.gov Data Is Insufficient

ClinicalTrials.gov provides structured trial metadata (phase, status, enrollment, dates), but lacks the drug-level fields critical for computing transition rates: drug class, mechanism of action, molecular target, modality, published efficacy data, and FDA regulatory linkage. Intervention names are inconsistent ("Adalimumab" vs "adalimumab" vs "Humira"), and there is no way to determine which trials belong to the same drug program without domain knowledge.

3.2 Raw Data Scope

Data Source	Rows	Key Fields
ClinicalTrials.gov studies	192,411	NCT identifier, phase, recruitment status, study type, enrollment, dates
Trial condition mappings	420,940	NCT identifier, raw condition text, normalized indication
Trial intervention records	424,618	NCT identifier, intervention type, intervention name, normalized modality
FDA application records	6,309	application number, first approval date, normalized indication
FDA–trial cross-links	1,879	application number, NCT identifier, link method

Filtering for RA (condition text matching "rheumatoid arthritis") identified 1,304 unique interventional trials across all phases.

3.3 9-Phase Enrichment Process

Each trial was enriched through a systematic, multi-tier process designed to maximize data quality while preventing hallucination.

Discovery & Scoping

Profile the trial universe: count by phase/status, identify top drugs and drug classes. For RA: 1,304 trials, hundreds of unique interventions.

Initial Ingestion

ClinicalTrials.gov studies loaded into the enrichment corpus with base CT.gov fields (NCT identifier, phase, status, enrollment, dates, sponsor). Starting confidence score: 0.20.

Tier 1 — Bulk Clinical Enrichment

Drug name consolidation (e.g., “Humira” → “Adalimumab” using INN standard). Primary endpoint extraction from CT.gov outcome measures. Trial duration calculation.

Tier 2 — Drug-Class Knowledge Enrichment

Most intensive phase. Batched by drug class (Anti-TNF first with ~180 trials, then JAK ~120, IL-6, Anti-CD20, etc.). For each drug: drug class, mechanism of action, molecular target, modality, route of administration, dosing regimen. For each trial: comparator, control type, line of therapy, patient population, combination therapy. 32 drug classes identified and consolidated.

Tier 3 — Published Outcomes & Efficacy

Terminated/withdrawn trials: automated from CT.gov’s stop-reason field. Phase 3 pivotal trials: manually mapped from published literature (ARMADA, RAPID, OPTION, ATTRACT, etc.). Extension studies and regional registration trials: batch-processed by title patterns. Strict anti-hallucination rules enforced.

Drug Commercial Profiles

19 drug profiles created with peak revenue, patent expiry, biosimilar status, line-of-therapy positioning. Held in a separate commercial-profile dataset to avoid redundancy (one drug can have dozens of trials).

Cross-Source Backfill

FDA application IDs and approval dates linked via the FDA-trial cross-link set. Patent and exclusivity data from the FDA Orange Book.

Outcome Summary Completion

Active/recruiting trials receive status-based summaries. Unknown-status trials receive generic summaries. Target: 100% outcome-summary coverage.

Verification & Anti-Hallucination Checks

Random sample spot checks (10-20 trials per batch). Drug class distribution sanity checks. Cross-reference FDA approval dates against known dates. Verify no future information leakage into outcome data.

3.4 Four Data Sources Per Trial

Source	Data Provided	Confidence
ClinicalTrials.gov	Phase, status, enrollment, dates, sponsor, structured fields	High
FDA Drugs@FDA	Application numbers, approval dates, regulatory status	High
PubMed	Efficacy data, outcome summaries, safety findings	Medium
Web Search	Press releases, analyst reports, pipeline updates	Low

Confidence score = weighted coverage across sources (0–1 scale). All 679 RA trials achieved "full" enrichment level (4 sources consulted).

3.5 Survivorship Bias Verification

Of the 1,304 raw RA trials, 625 were not enriched because they lacked drug-level metadata (non-drug interventions, unmappable entries, duplicate substudies). To verify this filtering was outcome-agnostic, we compared completion-to-termination ratios:

Phase	Raw Completion Rate	Enriched Completion Rate	Difference
Phase 1	88.3% (166/188)	87.8% (79/90)	-0.5pp
Phase 2	77.8% (242/311)	77.3% (102/132)	-0.5pp
Phase 3	91.6% (285/311)	91.7% (232/253)	+0.1pp
Phase 4	85.1% (149/175)	83.1% (108/130)	-2.0pp

No survivorship bias. Completion rates are virtually identical between raw and enriched datasets at every phase. The enrichment process removed trials by data availability, not by outcome.

3.6 Final Dataset

Metric	Value
Enriched RA trials	679
Distinct drugs	71
Drug classes	32
Columns per trial	45
Outcome summary coverage	100%
Drug class / MoA / target coverage	99.9%
FDA linkage	73%
Patent data	68%
Quantitative efficacy data	55%
Drug-level transitions: P1→P2	37 drugs
Drug-level transitions: P2→P3	50 drugs
Drug-level transitions: P3→Approval	35 drugs

4. Drug Cohort

4.1 Approved Drugs

Drug	Class	Sponsor	Decision Date	Decision Phase	FDA Approval
Adalimumab	TNF inhibitor	Abbott/AbbVie	Jan 1999	Phase 2	Dec 2002
Etanercept	TNF inhibitor	Immunex/Amgen	Jan 1996	Phase 2	Nov 1998
Rituximab	CD20 mAb	Genentech/Roche	Jan 2002	Phase 2	Feb 2006
Abatacept	CTLA-4 fusion	BMS	Jan 2002	Phase 2	Dec 2005
Tofacitinib	JAK inhibitor	Pfizer	Jan 2009	Phase 2	Nov 2012
Baricitinib	JAK inhibitor	Lilly/Incyte	Jan 2013	Phase 2	Jun 2018
Sarilumab	IL-6R mAb	Sanofi/Regeneron	Jan 2013	Phase 2	May 2017
Upadacitinib	JAK inhibitor	AbbVie	Jan 2016	Phase 2	Aug 2019

4.2 Failed Drugs

Drug	Class	Sponsor	Decision Date	Decision Phase	Failure Stage
Atacicept	BAFF/APRIL inhibitor	Merck Serono	Jan 2008	Phase 1	Phase 2 terminated
Tabalumab	BAFF mAb	Lilly	Jan 2012	Phase 2	Phase 3 failed
Fostamatinib	SYK inhibitor	Rigel	Jan 2010	Phase 2	Phase 3 failed
Ocrelizumab	CD20 mAb	Roche/Genentech	Jan 2007	Phase 2	Phase 3 terminated
Decernotinib	JAK3 inhibitor	Vertex	Jan 2014	Phase 2	Phase 3 not initiated
Vobarilizumab	IL-6R nanobody	Ablynx	Jan 2015	Phase 2	Phase 3 not initiated
Filgotinib	JAK1 inhibitor	Gilead/Galapagos	Jan 2019	Phase 3	FDA rejected
Peficitinib	JAK inhibitor	Astellas	Jan 2016	Phase 3	Not filed in US

4.3 Selection Rationale

Drugs were selected to span the full history of RA targeted therapy (1996-2019), covering multiple modalities (small molecule, monoclonal antibody, fusion protein, nanobody) and mechanisms (TNF, IL-6, JAK, CD20, BAFF, SYK, CTLA-4). The 8/8 approved/failed split ensures balanced class representation. All drugs reached at least Phase 2 in RA (except atacicept, which entered at Phase 1), providing sufficient clinical data for reconstruction.

5. Results Summary

Drug	Outcome	Decision Phase	PoS	rNPV	MC P50	Risk Flags	Correct?
Adalimumab	Approved	Phase 2	57.8%	$573M	$1.0B	NOVEL_MODALITY LIMITED_TRIAL_DATA	Yes
Etanercept	Approved	Phase 2	44.4%	$272M	-$67M	FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA	Yes
Ocrelizumab	Failed	Phase 2	39.5%	$1.5B	-$116M	LIMITED_TRIAL_DATA SAFETY_CLASS_SIGNAL	No
Filgotinib	Failed	Phase 3	39.3%	$2.3B	-$29M	HIGH_COMPETITION NOVEL_MODALITY SAFETY_CLASS_SIGNAL	No
Rituximab	Approved	Phase 2	36.3%	$2.0B	-$102M	FIRST_IN_CLASS NOVEL_MODALITY	Yes
Sarilumab	Approved	Phase 2	31.6%	$689M	-$138M	(none)	Yes
Peficitinib	Failed	Phase 3	27.3%	$253M	-$28M	HIGH_COMPETITION NOVEL_MODALITY SAFETY_CLASS_SIGNAL	No
Fostamatinib	Failed	Phase 2	26.6%	$306M	-$134M	FIRST_IN_CLASS NOVEL_MODALITY	No

Abatacept	Approved	Phase 2	25.7%	$349M	-$120M	FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA	Yes
Tabalumab	Failed	Phase 2	25.1%	$495M	-$161M	FIRST_IN_CLASS	No
Tofacitinib	Approved	Phase 2	25.0%	$556M	-$144M	FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA	Yes
Baricitinib	Approved	Phase 2	24.4%	$303M	-$161M	NOVEL_MODALITY SAFETY_CLASS_SIGNAL	Yes
Decernotinib	Failed	Phase 2	13.7%	$116M	-$178M	NOVEL_MODALITY SAFETY_CLASS_SIGNAL	No
Upadacitinib	Approved	Phase 2	13.4%	$697M	-$199M	HIGH_COMPETITION NOVEL_MODALITY SAFETY_CLASS_SIGNAL	Yes
Vobarilizumab	Failed	Phase 2	11.7%	$103M	-$159M	FIRST_IN_CLASS NOVEL_MODALITY	No
Atacicept	Failed	Phase 1	7.9%	$6M	-$88M	FIRST_IN_CLASS NOVEL_MODALITY LIMITED_TRIAL_DATA	Yes

Note: "Correct direction" means rNPV sign matches outcome. All drugs have positive rNPV, so "correct" = approved. The real discrimination is in the PoS ranking, not rNPV sign — which is why phase-controlled AUC is the primary metric.

6. Aggregate Accuracy Metrics

Metric	Score	Target	Result
Pairwise AUC	0.625 (40/64 pairs)	0.60	Pass
Phase-Controlled AUC	0.65	0.55	Pass
Separation Gap	+8.4pp (32.3% vs 23.9%)	10pp	Fail
Risk Flag Sensitivity	87.5% (7/8)	70%	Pass
Risk Flag Enrichment	1.0 (2.3 vs 2.3)	>1.0	Fail
Directional Accuracy	62.5% (40/64)	60%	Pass
False Confidence (≥25%)	50.0% (5/10)	<20%	Fail
False Confidence (≥60%)	0% (0/0)	<20%	Pass
Best Threshold Accuracy	62.5% at PoS 30%	--	--

The pairwise AUC of 0.625 is the headline discrimination metric: it passes the 0.60 target and measures the probability that a randomly chosen eventual success carries a higher PoS than a randomly chosen failure. The phase-controlled AUC of 0.65 confirms the signal holds within decision phase, removing the structural advantage that earlier decisions have over later ones (fewer remaining stages = mechanically higher cumulative PoS). At n=16 this is an early directional signal — Wilson 95% accuracy intervals on the conventional and optimal cuts both include chance-level performance, so discrimination is suggestive, not confirmatory.

The honest counterweight: calibration and separation are weak at this sample size. The separation gap between success and failure means is only +8.4pp against a 10pp target, and the false-confidence rate at the 25% PoS cut is 50% (5 of 10 above-threshold calls were failures) against a 20% target. Both fail. The story is discrimination passing while calibration lags — the expected shape for a directionally sound model that needs a larger, multi-indication cohort before its absolute PoS levels can be trusted.

Go/No-Go Threshold Analysis

PoS Cutoff	Accuracy	Precision	Recall	TP	TN	FP	FN
30.0% (best)	62.5%	66.7%	50.0%	4	6	2	4
35.0%	56.3%	60.0%	37.5%	3	6	2	5
40.0%	62.5%	100.0%	25.0%	2	8	0	6
45.0%	56.3%	100.0%	12.5%	1	8	0	7
50.0%	56.3%	100.0%	12.5%	1	8	0	7

7. Case Study: Atacicept (Model's Strongest Signal)

Atacicept

BAFF/APRIL inhibitor · Merck Serono · Decision: January 2008

Failed

7.9%

PoS

$6M

rNPV

-$88M

MC P50

92.1%

P(negative)

Atacicept received the lowest PoS in the cohort (7.9%) with 3 risk flags and a 0.60x target validation multiplier (no prior BAFF/APRIL approvals in RA). The Monte Carlo distribution heavily skewed negative: P10 = -$244M, P90 = -$31M, with 92.1% probability of negative outcome.

Outcome: Phase 2 terminated due to severe immunoglobulin reduction and fatal infections. The model correctly identified this as the highest-risk drug in the cohort.

Why this works: Atacicept combined an unvalidated mechanism (0.60x), a novel modality with no RA track record, limited trial data, and an early decision phase (Phase 1). Every signal aligned in the same direction — the model's conviction matched reality.

8. Case Study: Filgotinib (Model's Edge Case)

Filgotinib

JAK1-selective · Gilead/Galapagos · Decision: January 2019 (Phase 3)

Failed

39.3%

PoS

$2.3B

rNPV

-$29M

MC P50

Filgotinib carried one of the highest PoS values (39.3%) among the failed drugs. The model flagged HIGH_COMPETITION and SAFETY_CLASS_SIGNAL, but the 39% PoS — driven by the validated JAK pathway (tofacitinib and baricitinib already approved) — placed it above several successful drugs in the ranking.

Outcome: FDA rejected over testicular toxicity concerns — a drug-specific safety signal that class-level modeling cannot capture. The SAFETY_CLASS_SIGNAL flag was present (reflecting the JAK class's known cardiovascular and thrombotic risks), but the specific reproductive toxicity was unique to filgotinib.

Model limitation: Class-level safety flags capture systemic risks (e.g., JAK inhibitors and cardiovascular events), but drug-specific toxicities remain outside the model's scope. This is inherent to any model that operates at the mechanism level rather than the molecule level.

9. Computed Transition Rates

A central methodological choice in this back-test is replacing static BIO/QLS NDA/BLA transition rates with rates computed from the enrichment corpus. This is not a refinement — it is a fundamentally different measurement.

Two Different Questions

Source	NDA/BLA Rate	What It Measures
BIO/QLS 2021	91%	"Given filing, did NDA succeed?" (regulatory rubber-stamp rate)
Computed (enrichment corpus)	~42%	"Given Phase 3, did drug get FDA approval?" (real-world outcome rate)

The BIO/QLS rate of 91% measures a near-certainty: once a company files an NDA, it almost always gets approved. But the investment decision happens before filing — often years before. The relevant question is whether a drug in Phase 3 will ever reach and pass the NDA stage. Many drugs complete Phase 3 but never file (commercial viability, safety signals, competitive landscape shifts). The computed rate captures this full attrition.

Combined with drug-level counting (tracking individual drugs across phases, not trial counts) and time-gating (only using data available at decision date), this is a central source of the model's discriminative signal — pairwise AUC 0.625 and phase-controlled AUC 0.65.

Production status (as of this writing): the computed indication-specific transition rates described in this section are a research approach. Current production uses static BIO/QLS 2021 base rates.

Enriched trials data: 679 trials, 71 drugs, 45 structured columns. Drug-level transitions: P1 to P2 (37 drugs), P2 to P3 (50 drugs), P3 to Approval (35 drugs). 3-tier fallback: drug-class (n>=5) then RA-overall then BIO/QLS 2021.

10. Calibration

PoS Bucket	Drugs	Predicted Midpoint	Actual Success Rate	Gap
0-15%	4	7.5%	25.0%	17.5pp
15-30%	6	22.5%	50.0%	27.5pp
30-50%	5	40.0%	60.0%	20.0pp
50%+	1	75.0%	100.0%	25.0pp

With 16 drugs, calibration buckets are sparse. The model systematically underestimates PoS for drugs that succeed and overestimates for drugs that fail — which is consistent with a conservative model. Cross-indication expansion will improve statistical power.

11. Limitations

Sample size (n=16) — This is a proof of concept, not a powered validation study. Statistical significance requires cross-indication expansion.
Single indication (RA only) — Results may not generalize to oncology, rare disease, or CNS indications where PoS dynamics differ substantially.
Cost/revenue estimates are manual — Stage costs are PhaseFolio-authored scenario assumptions and peak revenue relies on analyst consensus, introducing subjectivity.
Class-level safety, not drug-level — The SAFETY_CLASS_SIGNAL flag captures mechanism-level risks but cannot detect molecule-specific toxicities (see: filgotinib).
Competitive density is count-based — The model counts competitors but does not assess differentiation, market positioning, or pricing dynamics.
Phase 3 cohort has only failures — Both Phase 3 decision-point drugs (filgotinib, peficitinib) failed, preventing within-phase discrimination testing at Phase 3.
No survivorship bias in data — Verified: completion rates are identical between the raw 1,304 and enriched 679 trial sets, confirming no systematic exclusion of failed trials.

12. Next Steps

Cross-indication expansion — Repeat the back-test for oncology (lung, breast), rare disease, and CNS cohorts. Target: n>=50 drugs across 4+ indications.
Drug commercial profiles — Integrate commercial-profile data (peak revenue, LOE dates, biosimilar entry) for automated revenue estimation.
Molecule-level safety signals — Incorporate FDA adverse event data (FAERS) to supplement class-level safety flags with drug-specific signal detection.
Prospective validation — Identify 10-15 drugs currently in Phase 2/3 and track model predictions against real-world outcomes over 3-5 years.
Calibration improvement — Apply Platt scaling or isotonic regression to recalibrate PoS outputs once cross-indication data provides sufficient sample size.
Competitive landscape integration — Replace count-based competitor density with the CT.gov landscape data (trial velocity, enrollment rates, phase distribution).