Diff Engine¶

What this page answers: How year-over-year section diffs are matched, scored, and when change components are null.


Inputs	Current and prior section text (matched by `section_name`)
Outputs	Lexical/semantic similarity, change score, language deltas
Version	Bundled with `section_extractor_v1` prior matching rules

In plain terms¶

The diff engine compares each section’s text to the same section in a prior comparable filing. It measures lexical and semantic similarity, detects new or intensified risk topics, and produces a 0–100 change score plus language deltas that feed aggregation.

When you’ll see this¶

CLI: automatic when --prior-html is set; EDGAR ticker flows resolve prior by default
HTTP: compare=prior on metrics and matrix routes (default)
Components affected: disclosure_change_score, event_severity_score, and diff inputs to risk_factor_intensity_score and internal_controls_risk_score
Null signal: no prior section → disclosure_change_score is null, not zero

Implemented in the diff_engine module.

Full specification

Quantifies meaningful qualitative change between current and prior comparable section text. Prior sections are matched by name in pipeline.py (_prior_by_name).

Prior section selection¶

When scoring with prior HTML (CLI --prior-html, HTTP compare=prior, or EDGAR prior resolution):

Case	Prior text
Prior section exists with same `section_name`	Prior section `cleaned_text`
No prior HTML or no matching section	`prior_text = None`

EDGAR ticker flows resolve the prior filing as same ticker, same form type, earlier filing date via resolve_prior_filing().

Never compare 10-K to 10-Q for primary scores.

Outputs¶

SectionDiffResult(
    lexical_similarity: float | None,      # 0–1
    semantic_similarity: float | None,     # 0–1
    length_change_pct: float | None,
    new_topics: list[str],
    removed_topics: list[str],
    intensified_topics: list[str],
    disclosure_change_score: float | None, # 0–100 (v1 formula)
    disclosure_change_score_v2: float | None, # sentence-aligned v2 formula
    added_sentence_count: int,
    removed_sentence_count: int,
    changed_numeric_count: int,
    added_risk_language_score: float | None,
    diff_evidence: dict,
    diff_summary: str,
    confidence_score: float,
    language_deltas: dict[str, float],
)

Similarity¶

Measure	Method
`lexical_similarity`	TF-IDF cosine similarity (sklearn, max 2000 features)
`semantic_similarity`	Embedding cosine similarity via `embedding_service.semantic_similarity()`

Topic detection¶

Topics come from keyword clusters in TOPIC_KEYWORDS (dictionaries.py):

New topics — in current, not in prior
Removed topics — in prior, not in current
Intensified topics — shared topics where current intensity > 1.2× prior (SEVERITY_WORDS weighting)

Change score formula¶

combined_sim = 0.6 × semantic_similarity + 0.4 × lexical_similarity

new_topic_score      = min(1.0, len(new_topics) / 3)
intensified_score    = min(1.0, len(intensified) / 3)
length_component     = clamp(length_change_pct, 0, 1)

disclosure_change_score =
    40 × (1 − combined_sim)
  + 20 × length_component
  + 20 × new_topic_score
  + 20 × intensified_score

clamp to [0, 100]

V2 sentence alignment (experimental)¶

disclosure_change_score_v2 supplements v1 with:

TF-IDF sentence alignment (align_sentences in text_matching.py)
Added-language risk density on unmatched current sentences
Numeric token add/remove/change detection (extract_numeric_tokens)
Richer diff_evidence provenance

v1 disclosure_change_score is unchanged for backward compatibility. score_deterministic_v2 may use v2 diff scores when present.

Language deltas¶

Per-ratio change vs prior section (percentage-point scale):

negative_language_delta     = (neg_cur − neg_prior) × 100
uncertainty_language_delta  = (unc_cur − unc_prior) × 100
legal_language_delta        = (lit_cur − lit_prior) × 100
constraining_language_delta = (con_cur − con_prior) × 100

Used by aggregation: uncertainty delta boosts disclosure_change_score; legal delta feeds legal_regulatory_risk_score.

Diff confidence¶

When prior text exists:

confidence = clamp(0.4, 0.95, 0.7 + semantic_similarity × 0.2 − |Δ uncertainty_ratio|)

No prior → disclosure_change_score = null, confidence_score = 0.2, summary "No prior comparable filing section available."

Never substitute 0 for null — missing prior is not “no change.”

Interpretation bands¶

Score	Meaning
0–25	Minor wording; high similarity
26–50	Moderate topic or tone shift
51–75	New or intensified risk topics
76–100	Major semantic shift or multiple new topics

Section use in aggregation¶

Component	Primary diff input
`disclosure_change_score`	Item 1A diff (60%) + MD&A diff (40%), plus uncertainty delta boost
`risk_factor_intensity_score`	Item 1A diff (25% of tone blend)
`internal_controls_risk_score`	Controls section diff
`event_severity_score`	Item 1A diff only

API exposure¶

GET /v1/company/{ticker}/disclosure-metrics returns per-section diffs and language_deltas when compare=prior (default).