Diff Engine¶
What this page answers: How year-over-year section diffs are matched, scored, and when change components are null.
Inputs |
Current and prior section text (matched by |
Outputs |
Lexical/semantic similarity, change score, language deltas |
Version |
Bundled with |
In plain terms¶
The diff engine compares each section’s text to the same section in a prior comparable filing. It measures lexical and semantic similarity, detects new or intensified risk topics, and produces a 0–100 change score plus language deltas that feed aggregation.
When you’ll see this¶
CLI: automatic when
--prior-htmlis set; EDGAR ticker flows resolve prior by defaultHTTP:
compare=prioron metrics and matrix routes (default)Components affected:
disclosure_change_score,event_severity_score, and diff inputs torisk_factor_intensity_scoreandinternal_controls_risk_scoreNull signal: no prior section →
disclosure_change_scoreisnull, not zero
Implemented in the diff_engine module.
Full specification
Quantifies meaningful qualitative change between current and prior comparable section text. Prior sections are matched by name in pipeline.py (_prior_by_name).
Prior section selection¶
When scoring with prior HTML (CLI --prior-html, HTTP compare=prior, or EDGAR prior resolution):
Case |
Prior text |
|---|---|
Prior section exists with same |
Prior section |
No prior HTML or no matching section |
|
EDGAR ticker flows resolve the prior filing as same ticker, same form type, earlier filing date via resolve_prior_filing().
Never compare 10-K to 10-Q for primary scores.
Outputs¶
SectionDiffResult(
lexical_similarity: float | None, # 0–1
semantic_similarity: float | None, # 0–1
length_change_pct: float | None,
new_topics: list[str],
removed_topics: list[str],
intensified_topics: list[str],
disclosure_change_score: float | None, # 0–100 (v1 formula)
disclosure_change_score_v2: float | None, # sentence-aligned v2 formula
added_sentence_count: int,
removed_sentence_count: int,
changed_numeric_count: int,
added_risk_language_score: float | None,
diff_evidence: dict,
diff_summary: str,
confidence_score: float,
language_deltas: dict[str, float],
)
Similarity¶
Measure |
Method |
|---|---|
|
TF-IDF cosine similarity (sklearn, max 2000 features) |
|
Embedding cosine similarity via |
Topic detection¶
Topics come from keyword clusters in TOPIC_KEYWORDS (dictionaries.py):
New topics — in current, not in prior
Removed topics — in prior, not in current
Intensified topics — shared topics where current intensity > 1.2× prior (
SEVERITY_WORDSweighting)
Change score formula¶
combined_sim = 0.6 × semantic_similarity + 0.4 × lexical_similarity
new_topic_score = min(1.0, len(new_topics) / 3)
intensified_score = min(1.0, len(intensified) / 3)
length_component = clamp(length_change_pct, 0, 1)
disclosure_change_score =
40 × (1 − combined_sim)
+ 20 × length_component
+ 20 × new_topic_score
+ 20 × intensified_score
clamp to [0, 100]
V2 sentence alignment (experimental)¶
disclosure_change_score_v2 supplements v1 with:
TF-IDF sentence alignment (
align_sentencesintext_matching.py)Added-language risk density on unmatched current sentences
Numeric token add/remove/change detection (
extract_numeric_tokens)Richer
diff_evidenceprovenance
v1 disclosure_change_score is unchanged for backward compatibility. score_deterministic_v2 may use v2 diff scores when present.
Language deltas¶
Per-ratio change vs prior section (percentage-point scale):
negative_language_delta = (neg_cur − neg_prior) × 100
uncertainty_language_delta = (unc_cur − unc_prior) × 100
legal_language_delta = (lit_cur − lit_prior) × 100
constraining_language_delta = (con_cur − con_prior) × 100
Used by aggregation: uncertainty delta boosts disclosure_change_score; legal delta feeds legal_regulatory_risk_score.
Diff confidence¶
When prior text exists:
confidence = clamp(0.4, 0.95, 0.7 + semantic_similarity × 0.2 − |Δ uncertainty_ratio|)
No prior → disclosure_change_score = null, confidence_score = 0.2, summary "No prior comparable filing section available."
Never substitute 0 for null — missing prior is not “no change.”
Interpretation bands¶
Score |
Meaning |
|---|---|
0–25 |
Minor wording; high similarity |
26–50 |
Moderate topic or tone shift |
51–75 |
New or intensified risk topics |
76–100 |
Major semantic shift or multiple new topics |
Section use in aggregation¶
Component |
Primary diff input |
|---|---|
|
Item 1A diff (60%) + MD&A diff (40%), plus uncertainty delta boost |
|
Item 1A diff (25% of tone blend) |
|
Controls section diff |
|
Item 1A diff only |
API exposure¶
GET /v1/company/{ticker}/disclosure-metrics returns per-section diffs and language_deltas when compare=prior (default).