Skip to content

Disclosure Alpha

section_extractor

alwank/disclosure-alpha

section_extractor¶

Use when: You have raw SEC filing HTML and need structured sections (Item 1A, MD&A, controls) before metrics or scoring.

Start here¶

extract_sections() — parse a FilingDocument into ExtractedSection list
FilingDocument — wrapper for HTML + form type + metadata
ExtractedSection — section name, cleaned text, confidence, warnings
required_sections_present() — check whether 10-K / 10-Q required sections extracted

For a one-liner from HTML string, use extract_sections_from_html() in pipeline.

Required section names for full coverage: item_1a_risk_factors + item_7_mdna (10-K) or item_2_mdna (10-Q). See Section Taxonomy.

Example¶

from disclosure_alpha import FilingDocument, extract_sections, required_sections_present

doc = FilingDocument(html=open("filing.html").read(), form_type="10-K")
sections = extract_sections(doc)
print([s.section_name for s in sections])
print(required_sections_present("10-K", sections))

Full API¶

class disclosure_alpha.section_extractor.FilingDocument(cik: str, accession_number: str, form_type: str, html: str)[source]¶

Bases: object

cik : str¶

accession_number : str¶

form_type : str¶

html : str¶

class disclosure_alpha.section_extractor.ExtractedSection(section_name: str, raw_text: str, cleaned_text: str, text_hash: str, word_count: int, sentence_count: int, extraction_confidence: float, extraction_method: str, parser_version: str, start_offset: int | None = None, end_offset: int | None = None, warnings: list[str] = <factory>)[source]¶

Bases: object

section_name : str¶

raw_text : str¶

cleaned_text : str¶

text_hash : str¶

word_count : int¶

sentence_count : int¶

extraction_confidence : float¶

extraction_method : str¶

parser_version : str¶

start_offset : int | None = None¶

end_offset : int | None = None¶

warnings : list[str]¶

class disclosure_alpha.section_extractor.ParserBlock(index: int, text: str, normalized_text: str, element_type: str, start_offset: int, end_offset: int, is_toc: bool, is_table: bool, is_title: bool)[source]¶

Bases: object

index : int¶

text : str¶

normalized_text : str¶

element_type : str¶

start_offset : int¶

end_offset : int¶

is_toc : bool¶

is_table : bool¶

is_title : bool¶

class disclosure_alpha.section_extractor.HeadingCandidate(section_name: str, block_index: int, start_offset: int, end_offset: int, score: float, reasons: list[str] = <factory>, warnings: list[str] = <factory>)[source]¶

Bases: object

section_name : str¶

block_index : int¶

start_offset : int¶

end_offset : int¶

score : float¶

reasons : list[str]¶

warnings : list[str]¶

disclosure_alpha.section_extractor.required_sections_present(form_type: str, extracted: list[ExtractedSection]) → bool[source]¶

disclosure_alpha.section_extractor.extract_sections(document: FilingDocument) → list[ExtractedSection][source]¶