Metrics Engine¶
What this page answers: What raw text signals are computed per section, which artifact versions apply, and which component scores they feed.
Inputs |
Extracted section |
Outputs |
Per-section ratios, flags, MD&A densities ( |
Version |
|
In plain terms¶
The metrics engine counts tone words, boilerplate phrases, specificity proxies, and risk flags in each extracted section. Those per-section numbers feed the diff engine and aggregation — they are the raw signals behind component scores like mdna_uncertainty_score and legal_regulatory_risk_score.
When you’ll see this¶
CLI:
disclosure-alpha metrics --html filing.html --form 10-KPython:
compute_section_metrics()in the pipeline moduleHTTP:
GET /v1/company/{ticker}/disclosure-metricsComponents affected: tone-driven scores (
risk_factor_intensity_score,mdna_uncertainty_score,boilerplate_risk_score, …) and flag-boosted scores (legal_regulatory_risk_score,liquidity_stress_score)
Module: text_metrics.py (metrics engine) and the built-in dictionary package (built_in_dictionaries_v3).
Full specification
Computes per-section text metrics, boolean risk flags, and MD&A keyword densities. Output feeds the diff engine and aggregation stage.
Input / output¶
SectionTextInput(section_name: str, cleaned_text: str) → TextMetricResult
All ratios are per-word unless noted. Scores on the 0–100 scale are capped with min(100, ...).
Tokenization¶
tokenize_words(text) # alphabetic tokens, lowercased
split_sentences(text) # split on . ! ?
Word counts use alphabetic tokens. Sentence count uses non-empty sentence splits (minimum 1).
Base metrics¶
Counts¶
Field |
Formula |
|---|---|
|
|
|
|
|
|
Tone ratios¶
Field |
Formula |
Dictionary |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For aggregation, ratios are multiplied by 100 before blending into 0–100 component scores.
Specificity¶
Field |
Formula |
|---|---|
|
|
|
|
Boilerplate¶
Field |
Formula |
|---|---|
|
|
Matched against BOILERPLATE_PHRASES in dictionaries.py.
Readability¶
Field |
Formula |
|---|---|
|
|
Boolean flags¶
detect_section_flags(text, section_name) returns all flags defined in FLAG_PATTERNS. Each flag is scoped to specific sections via FLAG_SECTION_SCOPE — out-of-scope sections always return false.
Representative flags used in aggregation (with +15 boost when true):
Flag |
Used in |
|---|---|
|
|
|
|
|
|
|
|
Other flags (e.g. cybersecurity_incident_flag) are computed and exposed via the metrics/flags API but are not blended into the deterministic headline score today.
MD&A density packs¶
compute_density_metrics(text, section_name) runs only on MD&A sections (item_7_mdna, item_2_mdna):
density = min(100, term_hits / word_count × 1000)
Key |
Term pack |
|---|---|
|
|
|
demand decline phrases |
|
margin compression phrases |
|
liquidity / covenant phrases |
Non-MD&A sections return zero densities. Densities are merged into aggregation for mdna_uncertainty_score and liquidity_stress_score — see Aggregation.
Pipeline output¶
compute_section_metrics() in pipeline.py returns a MetricsResult with:
Field |
Contents |
|---|---|
|
Per-section |
|
Per-section boolean flag dicts |
|
Per-section density dicts |
|
Per-section change scores (from diff engine) |
|
Per-section tone deltas vs prior |
Exposed via CLI metrics command, Python compute_section_metrics(), and GET /v1/company/{ticker}/disclosure-metrics.
Edge cases¶
Case |
Behavior |
|---|---|
Empty section |
Ratios 0; specificity 0; flags false; densities 0 |
Section out of flag scope |
Flag forced |
Non-MD&A section |
All density keys 0 |