Research Foundation

Peer-reviewed finance and accounting literature motivates the deterministic metrics in Disclosure Alpha. This page maps papers to what the open-source engine measures today.

Empirical results on the current release: Evidence and Validation.

Core references

Paper

Citation

Relevance

Loughran & McDonald (2011)

J. Finance 66(1), 35–65. DOI

Finance-specific word lists; tone links to returns, volatility, fraud, material weakness

Loughran & McDonald (2016)

J. Accounting Research 54(4), 1187–1230. DOI

Implementation pitfalls for textual measures

Cohen, Malloy & Nguyen (2020)

J. Finance 75(3), 1371–1415. DOI

Filing text changes predict returns, earnings, bankruptcy

Hope, Hu & Lu (2016)

Rev. Accounting Studies 21(4), 1005–1045. DOI

Specific risk-factor disclosure → market reaction

Lang & Stice-Lawrence (2015)

J. Accounting & Economics 60(2–3), 421–451. DOI

Boilerplate risk-factor language → liquidity effects

Dyer, Lang & Stice-Lawrence (2017)

JAE 64(2), 221–245. DOI

Topic trends and stickiness in 10-K text

Word-list tone ratios

Our metric

Literature category

Finding

negative_word_ratio

LM Negative

Associated with filing-date returns, volatility, fraud samples

uncertainty_word_ratio

LM Uncertainty

Associated with volatility

litigious_word_ratio

LM Litigious

Elevated in litigation / regulatory contexts

modal_word_ratio

LM Modal

Weak commitment language in MD&A

constraining_word_ratio

Custom (LM-adjacent)

Covenant / obligation language → liquidity stress proxy

Implementation: Built-in word lists in src/disclosure_alpha/dictionaries/ (built_in_dictionaries_v3). These are finance-inspired curated lists shipped with the repo — not a redistribution of the full Loughran–McDonald master dictionary.

Specificity

Our metric

Literature basis

numeric_specificity_score

Hope et al. — numeric detail in risk factors; we count numeric tokens per 100 words

company_specificity_score

Hope et al. — entity specificity; we proxy via capitalized terms, geography, and segment phrases

Boilerplate

Our metric

Literature basis

boilerplate_phrase_ratio

Lang & Stice-Lawrence — boilerplate risk-factor language; we use a fixed phrase list matched per section

Our measure is a section-level phrase hit rate, not Lang & Stice-Lawrence’s cross-firm 4-gram frequency measure.

Readability

Our metric

Literature basis

readability_score

Inspired by readability literature (Li 2008; Miller 2010); implemented as a custom heuristic from sentence length and long-word share

Used inside the MD&A uncertainty blend — not a standalone Fog index replication.

Section change

Our metric

Literature basis

lexical_similarity

Cohen et al. — TF-IDF cosine similarity

semantic_similarity

Embedding-based similarity (extension beyond Cohen et al. baseline)

disclosure_change_score

Cohen et al. “changer” concept — weighted composite of similarity, length, and topic shifts

new_topics / intensified_topics

Dyer et al. topic framing — keyword-cluster proxy

language_deltas

Tone change vs prior section: (current_ratio prior_ratio) × 100

Details: Diff Engine.

Boolean flags

Hard-event phrase flags (material weakness, restatement, going concern, etc.) are grounded in SEC disclosure language and supported by LM (2011) and related event-literature. Flags fire only in scoped sections — see Metrics Engine.

Licensing note

Loughran–McDonald word lists are free for academic use; commercial redistribution requires separate permission. Disclosure Alpha ships its own built-in lists to avoid licensing ambiguity. Do not describe scores as “Loughran–McDonald-based” unless you load licensed LM lists yourself.