SEC EDGAR Deal-Term Extraction
This is methodology@2026-05-25 as it was on 2026-05-25 — the immutable record any signed export stamped methodology@2026-05-25 was computed under. It is intentionally the citation prose, not the current presentation, and it will never change. The methodology has since advanced (current is methodology@2026-06-17). Current version · Full version history · Verify an export
We extract structured deal terms (upfront, milestones, royalty %) from public SEC filings — 8-K Item 1.01 (Entry into Material Definitive Agreement) — for a curated cohort of public biotech sponsors. v0 ships ingestion + extraction + storage with full prompt/classifier version stamping and source-quote capture; 10-K Exhibit 10.x is scoped for v1. Every extracted field carries a confidence score; values below 0.7 are flagged for human review.
1. What this section covers
PhaseFolio ingests deal-term data from SEC EDGAR for a curated cohort of public biotech sponsors. The extracted fields — upfront cash, near-term and total milestones, royalty rate range, equity, counterparty, asset, indication, effective date, territory, exclusivity — will feed comparable-deal benchmarking inside the rNPV engine and the diligence dossier. This page documents how that pipeline works at v0, what it does and does not claim, and the versioning of every model-driven step.
v0 surface posture. Customer-facing surfacing (dossier consumption, share-link visibility) is pending counsel sign-off of the legal-posture memo. This methodology page is published now as the substrate trust artifact; the data behind it remains internal at v0.
2. Data source
SEC EDGAR via the SEC's official endpoints (data.sec.gov and www.sec.gov). SEC expressly permits free access/reuse of EDGAR public filing content; PhaseFolio extracts factual deal terms from these public filings. SEC's fair-access policy requires a descriptive User-Agent containing a contact email and compliance with a 10 requests/second rate cap; the pipeline implements both.
No AGPL- or other copyleft-licensed code sits in the ingestion path: the pipeline does not depend on the edgartools or sec-edgar-mcp projects.
3. Cohort (v0)
v0 ingests the AMR-indication biotech sponsor cohort — approximately 23 public companies resolved from the enriched antimicrobial trial dataset. The cohort is the union of trial sponsors whose programs surface in PhaseFolio's enriched AMR trials table. NSCLC and RA cohort expansion is scoped for v1.
4. Forms covered (v0 vs v1)
- v0 (this release): Form 8-K Item 1.01 ("Entry into a Material Definitive Agreement") filed on or after 2020-01-01. These are the press-release-grade summaries that announce a license, collaboration, or asset-purchase agreement.
- v1 (deferred): Form 10-K Exhibit 10.x license agreement bodies. Exhibits 10.x contain the operative contract text (and any redactions permitted under Reg S-K Item 601). They will warrant a supplemental counsel review at v1 because of the different content shape, redaction handling, and longer document budget.
- v1 (deferred): Form 4 (insider transactions) and XBRL financial fact extraction for comparator financials.
5. Extraction pipeline
- Per-CIK submissions fetch. For each cohort sponsor, the pipeline pulls the SEC submissions JSON and filters to 8-K filings on or after 2020-01-01.
- Cheap regex classifier. Each candidate filing is passed through a deterministic classifier (
classifier@2026-05-25-v1) that asks "does this 8-K describe a license / collaboration / asset-purchase deal at all?" — a fast pre-filter that drops obvious non-deal Item 1.01 filings before any LLM cost is incurred. - LLM-assisted structured extraction. Filings that pass the classifier are passed to a versioned extractor prompt (
deal_terms_extractor@2026-05-25-v1) running on Anthropic Claude with prompt caching. The prompt is split into a stable cached system block and a per-filing user block so that the cached portion does not double-send across calls. - Provenance capture. Every extraction stores:
- The extracted field values (upfront, milestones, royalty range, etc.).
- A source quote of at most 500 characters from the filing.
- The source section (8-K Item 1.01).
- The extractor prompt version.
- The model-self-reported confidence score per field.
- Confidence gating. Extractions with confidence below the threshold (
NEEDS_REVIEW_CONFIDENCE_THRESHOLD = 0.7) are flagged for human review. The first cohort run includes manual QA on all extractions per the v0 pilot-run report; the threshold may be re-calibrated and the change recorded in the Decision Log.
6. Fields extracted
- Upfront cash (USD).
- Near-term milestones — sum of development and regulatory milestones (USD).
- Total milestones — sum of dev + reg + commercial milestones (USD).
- Royalty rate range — low % and high %; royalty tiers when disclosed.
- Equity component when present.
- Counterparty (the deal partner; not the filing CIK).
- Asset / program being licensed.
- Indication (PhaseFolio taxonomy when mappable; verbatim otherwise).
- Effective date, territory, exclusivity.
Royalty band conventions ("low single" → 3%, "mid teens" → 15%, etc.) follow defensible industry anchors approximating BIO Industry Analysis 2021 and Tufts CSDD deal-term tables. Re-anchoring requires an EXTRACTOR_VERSION bump.
7. Storage
Extracted data lives in a private Supabase schema (sec) with four tables: filings (one row per fetched filing + its metadata), deal_terms (one row per extraction), ingestion_runs (one row per pipeline run), and cohort_seed (the resolved sponsor list). The raw filing HTML is retained inside our private database; no raw filing text is redistributed publicly.
8. Admin operations
Four admin endpoints (X-Admin-Key gated, per-IP rate-limited) drive the pipeline: seed-cohort, ingest, runs, and stats. They are not exposed to general users.
9. What we do not claim
- v0 is 8-K-only. 10-K Exhibit 10.x extraction is v1 scope. Press-release-grade 8-K summaries omit operative contract clauses (true-up provisions, anti-stacking, step-downs, change-of-control); deal terms drawn from 8-Ks should be read as headlines, not as the full economic shape of the deal.
- Redactions are not reconstructed. Reg S-K Item 601 permits issuers to redact specified terms in filed exhibits. Redacted fields surface as null in our schema; we never infer numbers that the issuer chose to redact.
- No raw text redistribution. A citation, an accession-number link to EDGAR, and a quote of at most 500 characters is the unit of evidence we expose — not the raw filing.
- LLM extraction can be wrong. Every field carries a confidence score; values below 0.7 are flagged for human review and excluded from any aggregate until reviewed.
- The cohort is curated, not exhaustive. v0 covers the AMR cohort only. Conclusions drawn from v0 generalize only within that cohort.
- Sponsor pagination cap. The default per-sponsor filing limit is 200; sponsors with more than 200 filings since 2020 may have older 8-Ks dropped at v0. The drop is logged per sponsor and lifted as the pipeline matures.
10. Versioning
This benchmark is governed by three version stamps that must move together when their underlying artifact changes:
- Dataset:
sec_deal_terms@2026-Q2(bucket 4, CalVer; seedocs/versioning.md). - Extractor prompt:
deal_terms_extractor@2026-05-25-v1(any prompt-text, schema, or rule change requires a bump). - Classifier:
classifier@2026-05-25-v1(any rule-set change requires a bump).
Old extractions remain verifiable against the prompt version they were produced under: the per-row extractor_version column means re-extraction is possible without re-fetching the filing.
Key facts
| Forms covered (v0) | Form 8-K Item 1.01 only, filed on or after 2020-01-01. 10-K Exhibit 10.x = v1 scope. |
| Cohort (v0) | AMR-indication biotech sponsors (~23 companies resolved from enriched antimicrobial trial dataset). NSCLC + RA = v1. |
| Data source | SEC EDGAR (data.sec.gov + www.sec.gov), public SEC filings. SEC expressly permits free access/reuse of EDGAR public filing content; PhaseFolio extracts factual deal terms from these public filings. |
| License posture | No AGPL- or copyleft-licensed code in ingestion path. No dependency on edgartools or sec-edgar-mcp. |
| Extractor prompt | deal_terms_extractor@2026-05-25-v1 (Anthropic Claude, prompt-cached, function-split to avoid double-send). |
| Classifier | classifier@2026-05-25-v1 (cheap regex pre-filter to drop non-deal Item 1.01 filings before LLM cost). |
| Review threshold | NEEDS_REVIEW_CONFIDENCE_THRESHOLD = 0.7. First cohort runs are manually QA'd on all extractions per v0 pilot-run report. |
| Dataset version | sec_deal_terms@2026-Q2 (bucket-4 CalVer). |
| Customer-facing posture | v0 surface is INTERNAL only. Customer-facing surfacing (dossier consumption, share-link visibility) gated on counsel sign-off of the legal-posture memo. |
References
01US Securities and Exchange Commission — EDGAR Full-Text Search and submissions JSON endpoints (data.sec.gov, www.sec.gov). Public SEC filing source.
02SEC Fair Access Policy — User-Agent + 10 req/s rate cap.
03Regulation S-K Item 601 — Exhibits (including the redaction provisions for material agreements filed as Exhibits 10.x).
04BIO Industry Analysis 2021 — Clinical Development Success Rates and Contributing Factors.
05Tufts Center for the Study of Drug Development — deal-term and royalty-rate tables.
Frozen snapshot · methodology version: methodology@2026-05-25 · Last updated: 2026-05-25 · Version history →