Deterministic Scoring Overview

What it is

Deterministic scoring converts extracted SEC filing sections into a 0–100 disclosure risk matrix using only:

  1. Text metrics — word-list ratios, specificity proxies, readability

  2. Section diffs — change vs prior comparable filing

  3. Boolean flags — phrase-pattern risk events

  4. Language deltas — tone ratio shifts vs prior section

No LLM. Fully reproducible given the same version strings and input text.

What it is not

  • Not a buy/sell signal or investment advice

  • Not a substitute for reading the filing

  • Not full S&P 500 index coverage in empirical cohorts — see What This Does and Does Not Claim

Pipeline

flowchart TB
  ingest["Ingest (HTML or EDGAR)"]
  extract["extract_sections_from_html()"]
  metrics["compute_section_metrics()"]
  aggregate["aggregate_deterministic_matrix()"]
  output["ScoreResult JSON"]

  ingest --> extract
  extract --> metrics
  metrics --> aggregate
  aggregate --> output

  subgraph deterministic ["Deterministic stage"]
    metrics
  end

Text equivalent:

ingest (HTML or EDGAR)
    ↓
extract sections (Item 1A, MD&A, …)
    ↓
deterministic stage
  • text metrics (tone, boilerplate, specificity, …)
  • boolean risk flags
  • section diffs vs prior comparable filing
    ↓
aggregate
  • 9 weighted component scores (0–100)
  • overall disclosure risk score + confidence

CLI, Python SDK, HTTP API, and MCP all call this same pipeline.

How to read a score

Before diving into formulas, see Understanding Scores for an annotated JSON walkthrough, component plain-English names, and the score scale.

Artifact versions

Version strings appear in every score response. Canonical lookup: Versioning and Reproducibility and Changelog — not duplicated here.

Score scale

All component scores use 0–100. Higher values mean more disclosure risk or deterioration, except for specificity_quality_score (higher = better specificity).

Range

Interpretation

0–25

Low concern

26–50

Moderate

51–75

Elevated

76–100

High

Specificity inversion: Most components rise when language gets worse. specificity_quality_score is the opposite — a higher value means the filing is more specific (numbers, named entities, concrete detail). It is returned in components but is not part of the headline overall_disclosure_risk_score weights.

Component scores

Nine deterministic components feed the headline overall_disclosure_risk_score (weights in methodology/aggregation). specificity_quality_score is also returned but excluded from headline weights — see the score scale include for its inversion.

Plain English

JSON field

Primary section(s)

Risk-factor tone & volatility

risk_factor_intensity_score

Item 1A

Year-over-year disclosure change

disclosure_change_score

Item 1A, MD&A

MD&A uncertainty & demand stress

mdna_uncertainty_score

Item 7 (10-K) / Item 2 (10-Q)

Legal & regulatory risk language

legal_regulatory_risk_score

Item 1A (+ flags)

Liquidity & covenant stress

liquidity_stress_score

MD&A (+ flags)

Boilerplate & vague risk language

boilerplate_risk_score

Item 1A

Internal controls weakness signals

internal_controls_risk_score

Controls disclosure + Item 1A

Material event severity (diff-only)

event_severity_score

Item 1A

Cross-section negative tone

tone_negativity_score

Item 1A + MD&A

Headline score: weighted mean of nine components listed in COMPONENT_WEIGHTS. specificity_quality_score is computed and returned in components but is not in the headline weights.

Blend formulas and weights: Aggregation.

Headline score and confidence

overall = weighted mean of present headline components (weights renormalized)
coverage = (# non-null headline components) / 9
confidence = blend of extraction confidence, diff confidence, and coverage

Request provenance via HTTP include=provenance or inspect CLI JSON output.

Prior filing rules

When prior HTML is supplied (CLI --prior-html, HTTP compare=prior, or EDGAR prior resolution):

  • Match by same section name between current and prior extractions

  • Prior filing is resolved as the same ticker, same form type, earlier filing date (see EDGAR resolver)

  • No prior section → disclosure_change_score = null for that section (not zero)

Never compare 10-K sections to 10-Q for primary scores.

Required sections by form

Form

Required for full coverage

10-K

item_1a_risk_factors, item_7_mdna

10-Q

item_1a_risk_factors, item_2_mdna

Missing sections → lower score_coverage_ratio, component nulls, reduced confidence. Section names: Section Taxonomy.

HTTP API

Endpoint

Purpose

GET /v1/company/{ticker}/disclosure-metrics

Raw metrics, flags, diffs

GET /v1/company/{ticker}/disclosure-matrix

Component scores from deterministic scoring