Research Foundation¶
Peer-reviewed finance and accounting literature motivates the deterministic metrics in Disclosure Alpha. This page maps papers to what the open-source engine measures today.
Empirical results on the current release: Evidence and Validation.
Core references¶
Paper |
Citation |
Relevance |
|---|---|---|
Loughran & McDonald (2011) |
J. Finance 66(1), 35–65. DOI |
Finance-specific word lists; tone links to returns, volatility, fraud, material weakness |
Loughran & McDonald (2016) |
J. Accounting Research 54(4), 1187–1230. DOI |
Implementation pitfalls for textual measures |
Cohen, Malloy & Nguyen (2020) |
J. Finance 75(3), 1371–1415. DOI |
Filing text changes predict returns, earnings, bankruptcy |
Hope, Hu & Lu (2016) |
Rev. Accounting Studies 21(4), 1005–1045. DOI |
Specific risk-factor disclosure → market reaction |
Lang & Stice-Lawrence (2015) |
J. Accounting & Economics 60(2–3), 421–451. DOI |
Boilerplate risk-factor language → liquidity effects |
Dyer, Lang & Stice-Lawrence (2017) |
JAE 64(2), 221–245. DOI |
Topic trends and stickiness in 10-K text |
Word-list tone ratios¶
Our metric |
Literature category |
Finding |
|---|---|---|
|
LM Negative |
Associated with filing-date returns, volatility, fraud samples |
|
LM Uncertainty |
Associated with volatility |
|
LM Litigious |
Elevated in litigation / regulatory contexts |
|
LM Modal |
Weak commitment language in MD&A |
|
Custom (LM-adjacent) |
Covenant / obligation language → liquidity stress proxy |
Implementation: Built-in word lists in src/disclosure_alpha/dictionaries/ (built_in_dictionaries_v3). These are finance-inspired curated lists shipped with the repo — not a redistribution of the full Loughran–McDonald master dictionary.
Specificity¶
Our metric |
Literature basis |
|---|---|
|
Hope et al. — numeric detail in risk factors; we count numeric tokens per 100 words |
|
Hope et al. — entity specificity; we proxy via capitalized terms, geography, and segment phrases |
Boilerplate¶
Our metric |
Literature basis |
|---|---|
|
Lang & Stice-Lawrence — boilerplate risk-factor language; we use a fixed phrase list matched per section |
Our measure is a section-level phrase hit rate, not Lang & Stice-Lawrence’s cross-firm 4-gram frequency measure.
Readability¶
Our metric |
Literature basis |
|---|---|
|
Inspired by readability literature (Li 2008; Miller 2010); implemented as a custom heuristic from sentence length and long-word share |
Used inside the MD&A uncertainty blend — not a standalone Fog index replication.
Section change¶
Our metric |
Literature basis |
|---|---|
|
Cohen et al. — TF-IDF cosine similarity |
|
Embedding-based similarity (extension beyond Cohen et al. baseline) |
|
Cohen et al. “changer” concept — weighted composite of similarity, length, and topic shifts |
|
Dyer et al. topic framing — keyword-cluster proxy |
|
Tone change vs prior section: |
Details: Diff Engine.
Boolean flags¶
Hard-event phrase flags (material weakness, restatement, going concern, etc.) are grounded in SEC disclosure language and supported by LM (2011) and related event-literature. Flags fire only in scoped sections — see Metrics Engine.
Licensing note¶
Loughran–McDonald word lists are free for academic use; commercial redistribution requires separate permission. Disclosure Alpha ships its own built-in lists to avoid licensing ambiguity. Do not describe scores as “Loughran–McDonald-based” unless you load licensed LM lists yourself.