section_extractor¶
Use when: You have raw SEC filing HTML and need structured sections (Item 1A, MD&A, controls) before metrics or scoring.
Start here¶
extract_sections()— parse aFilingDocumentintoExtractedSectionlistFilingDocument— wrapper for HTML + form type + metadataExtractedSection— section name, cleaned text, confidence, warningsrequired_sections_present()— check whether 10-K / 10-Q required sections extracted
For a one-liner from HTML string, use extract_sections_from_html() in pipeline.
Required section names for full coverage: item_1a_risk_factors + item_7_mdna (10-K) or item_2_mdna (10-Q). See Section Taxonomy.
Example¶
from disclosure_alpha import FilingDocument, extract_sections, required_sections_present
doc = FilingDocument(html=open("filing.html").read(), form_type="10-K")
sections = extract_sections(doc)
print([s.section_name for s in sections])
print(required_sections_present("10-K", sections))
Full API¶
- class disclosure_alpha.section_extractor.FilingDocument(cik: str, accession_number: str, form_type: str, html: str)[source]¶
Bases:
object- cik : str¶
- accession_number : str¶
- form_type : str¶
- html : str¶
- class disclosure_alpha.section_extractor.ExtractedSection(section_name: str, raw_text: str, cleaned_text: str, text_hash: str, word_count: int, sentence_count: int, extraction_confidence: float, extraction_method: str, parser_version: str, start_offset: int | None = None, end_offset: int | None = None, warnings: list[str] = <factory>)[source]¶
Bases:
object- section_name : str¶
- raw_text : str¶
- cleaned_text : str¶
- text_hash : str¶
- word_count : int¶
- sentence_count : int¶
- extraction_confidence : float¶
- extraction_method : str¶
- parser_version : str¶
-
start_offset : int | None =
None¶
-
end_offset : int | None =
None¶
- warnings : list[str]¶
- class disclosure_alpha.section_extractor.ParserBlock(index: int, text: str, normalized_text: str, element_type: str, start_offset: int, end_offset: int, is_toc: bool, is_table: bool, is_title: bool)[source]¶
Bases:
object- index : int¶
- text : str¶
- normalized_text : str¶
- element_type : str¶
- start_offset : int¶
- end_offset : int¶
- is_toc : bool¶
- is_table : bool¶
- is_title : bool¶
- class disclosure_alpha.section_extractor.HeadingCandidate(section_name: str, block_index: int, start_offset: int, end_offset: int, score: float, reasons: list[str] = <factory>, warnings: list[str] = <factory>)[source]¶
Bases:
object- section_name : str¶
- block_index : int¶
- start_offset : int¶
- end_offset : int¶
- score : float¶
- reasons : list[str]¶
- warnings : list[str]¶
- disclosure_alpha.section_extractor.required_sections_present(form_type: str, extracted: list[ExtractedSection]) bool[source]¶
- disclosure_alpha.section_extractor.extract_sections(document: FilingDocument) list[ExtractedSection][source]¶