section_extractor

Use when: You have raw SEC filing HTML and need structured sections (Item 1A, MD&A, controls) before metrics or scoring.

Start here

  • extract_sections() — parse a FilingDocument into ExtractedSection list

  • FilingDocument — wrapper for HTML + form type + metadata

  • ExtractedSection — section name, cleaned text, confidence, warnings

  • required_sections_present() — check whether 10-K / 10-Q required sections extracted

For a one-liner from HTML string, use extract_sections_from_html() in pipeline.

Required section names for full coverage: item_1a_risk_factors + item_7_mdna (10-K) or item_2_mdna (10-Q). See Section Taxonomy.

Example

from disclosure_alpha import FilingDocument, extract_sections, required_sections_present

doc = FilingDocument(html=open("filing.html").read(), form_type="10-K")
sections = extract_sections(doc)
print([s.section_name for s in sections])
print(required_sections_present("10-K", sections))

Full API

class disclosure_alpha.section_extractor.FilingDocument(cik: str, accession_number: str, form_type: str, html: str)[source]

Bases: object

cik : str
accession_number : str
form_type : str
html : str
class disclosure_alpha.section_extractor.ExtractedSection(section_name: str, raw_text: str, cleaned_text: str, text_hash: str, word_count: int, sentence_count: int, extraction_confidence: float, extraction_method: str, parser_version: str, start_offset: int | None = None, end_offset: int | None = None, warnings: list[str] = <factory>)[source]

Bases: object

section_name : str
raw_text : str
cleaned_text : str
text_hash : str
word_count : int
sentence_count : int
extraction_confidence : float
extraction_method : str
parser_version : str
start_offset : int | None = None
end_offset : int | None = None
warnings : list[str]
class disclosure_alpha.section_extractor.ParserBlock(index: int, text: str, normalized_text: str, element_type: str, start_offset: int, end_offset: int, is_toc: bool, is_table: bool, is_title: bool)[source]

Bases: object

index : int
text : str
normalized_text : str
element_type : str
start_offset : int
end_offset : int
is_toc : bool
is_table : bool
is_title : bool
class disclosure_alpha.section_extractor.HeadingCandidate(section_name: str, block_index: int, start_offset: int, end_offset: int, score: float, reasons: list[str] = <factory>, warnings: list[str] = <factory>)[source]

Bases: object

section_name : str
block_index : int
start_offset : int
end_offset : int
score : float
reasons : list[str]
warnings : list[str]
disclosure_alpha.section_extractor.required_sections_present(form_type: str, extracted: list[ExtractedSection]) bool[source]
disclosure_alpha.section_extractor.extract_sections(document: FilingDocument) list[ExtractedSection][source]