Read Part 2 and Part 3

In the evolving landscape of artificial intelligence, combining advanced techniques like Retrieval-Augmented Generation (RAG) and Named Entity Recognition (NER) has opened new avenues for extracting and structuring information from complex documents. This blog delves into the intricacies of building a 10-Q Analyzer—a tool I designed to process SEC 10-Q filings, summarize their content, and extract key financial metrics into a structured JSON format. We will explore the challenges of generating structured outputs compared to traditional Q&A or chat-based RAG systems, the synergy between RAG and NER, and provide a detailed walkthrough of the implementation code.

Github Repository for this project: https://github.com/wisemachine/pb-rag-10q-analyzer

Introduction
Understanding RAG and NER
Combining RAG with NER for Structured JSON Output
Challenges of Structured JSON Output vs. Q&A or Chat RAG Systems
Code Walkthrough
- Environment Setup
- Summarization Model Initialization
- Vector Store Initialization
- Processing the 10-Q PDF
- Numeric Extraction with NER and Regex
- Data Cleaning
- Saving Output
- Main Function
Conclusion

Introduction

SEC 10-Q filings are comprehensive quarterly reports that publicly traded companies must submit, detailing their financial performance. These documents are rich in information but can be lengthy and complex, posing a challenge for retail investors seeking actionable insights. The 10-Q Analyzer project leverages the power of RAG and NER to automate the summarization of these reports and extract key financial metrics, presenting them in a structured JSON format. This approach not only saves time but also enhances the accuracy and accessibility of critical financial data.

Understanding RAG and NER

Retrieval-Augmented Generation (RAG)

RAG is a hybrid approach that combines retrieval-based methods with generative models to enhance the quality and relevance of generated content. In essence, RAG systems retrieve relevant documents or snippets from a large corpus and use this information to generate more accurate and contextually appropriate responses.

Key Components:

Retriever: Fetches relevant documents from a knowledge base based on the input query.
Generator: Generates responses by conditioning on both the input query and the retrieved documents.

Named Entity Recognition (NER)

NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, monetary values, etc. In the context of financial documents, NER is pivotal for extracting quantitative metrics like revenue, net income, and cash flows.

Key Components:

Entity Detection: Identifies spans of text that constitute entities.
Entity Classification: Categorizes detected entities into predefined types.

Combining RAG with NER for Structured JSON Output

The integration of RAG and NER enables the system to not only generate summaries of the 10-Q filings but also extract and structure key financial metrics. Here’s how the combination works:

Document Retrieval and Summarization (RAG):
- Retrieval: The system retrieves relevant sections from the 10-Q PDF using a vector store (e.g., Chroma) that indexes document embeddings.
- Generation: A generative model (e.g., Flan-T5) summarizes the retrieved content, producing a consolidated summary of the financial report.
Entity Extraction (NER):
- Detection and Classification: Using SpaCy’s NER capabilities, the system identifies monetary values and specific financial terms within the summary.
- Regex Patterns: Complementing NER, regular expressions are employed to capture predefined financial metrics, ensuring higher precision in extraction.
Structured Output:
- The extracted entities and metrics are compiled into a structured JSON format, facilitating easy consumption and analysis by retail investors.

Challenges of Structured JSON Output vs. Q&A or Chat RAG Systems

While RAG systems excel in generating natural language responses for Q&A or chat interfaces, extending them to produce structured JSON outputs presents unique challenges:

Precision vs. Flexibility:
- Q&A Systems: Focus on generating coherent and contextually relevant text, with flexibility in phrasing.
- Structured Output: Requires high precision in data extraction and adherence to a predefined schema, limiting the flexibility in responses.
Data Validation and Consistency:
- Ensuring that extracted data points conform to expected formats (e.g., numeric values) and maintain consistency across different documents.
Complexity of Extraction:
- Financial documents contain varied terminologies and structures, making it challenging to create comprehensive extraction rules that generalize well.
Error Handling:
- Structured systems must robustly handle cases where expected data is missing or malformed, necessitating sophisticated error detection and correction mechanisms.
Integration of Multiple Components:
- Combining RAG with NER and regex-based extraction requires seamless integration to ensure that the summarization and extraction processes complement each other without introducing redundancies or conflicts.

Code Walkthrough

Let’s explore the implementation of the 10-Q Analyzer project step by step. The code is organized to facilitate readability, maintainability, and scalability.

Environment Setup

import os
import json
import re
import spacy

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
DB_FOLDER = "chroma_db"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Explanation:

Imports: The necessary libraries for NLP tasks, document loading, text splitting, vector storage, and model deployment are imported.
Environment Variables: Using dotenv, environment variables are loaded to manage configurations securely.
Tokenizer Parallelism: Setting TOKENIZERS_PARALLELISM to false to prevent potential warnings or performance issues related to parallel processing of tokenizers.

Summarization Model Initialization

def initialize_summarizer_model():
    print("Initializing summarization model...")
    model_name = "google/flan-t5-base"  # Alternatively, "google/flan-t5-large" for larger capacity
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    summarizer_pipeline = pipeline(
        "summarization",
        model=model,
        tokenizer=tokenizer,
        framework="pt",
        max_length=150,
        min_length=50
    )
    summarizer = HuggingFacePipeline(pipeline=summarizer_pipeline)
    print("Summarization model loaded successfully!")
    return summarizer

Explanation:

Model Selection: Utilizes the flan-t5-base model from Google for summarization. A larger model (flan-t5-large) can be used if computational resources permit.
Pipeline Creation: Sets up a HuggingFace summarization pipeline with specified parameters for maximum and minimum summary lengths.
Wrapper: Wraps the pipeline using HuggingFacePipeline from LangChain for seamless integration with other LangChain components.

Vector Store Initialization

def initialize_vector_store():
    embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    if not os.path.exists(DB_FOLDER):
        os.makedirs(DB_FOLDER)
    vector_store = Chroma(persist_directory=DB_FOLDER, embedding_function=embedding_model)
    return vector_store, embedding_model

Explanation:

Embedding Model: Uses all-MiniLM-L6-v2 from Sentence Transformers for generating embeddings that capture semantic meanings of text.
Chroma Vector Store: Initializes Chroma as the vector store to persist and manage document embeddings, facilitating efficient retrieval in RAG processes.

Processing the 10-Q PDF

def process_10q(ticker, summarizer):
    folder_path = os.path.join("10q_docs", ticker)
    if not os.path.exists(folder_path):
        raise FileNotFoundError(f"No folder found for ticker: {ticker}")
    pdf_files = [f for f in os.listdir(folder_path) if f.endswith(".pdf")]
    if not pdf_files:
        raise FileNotFoundError(f"No PDF files found in folder: {folder_path}")
    pdf_path = os.path.join(folder_path, pdf_files[0])

    print(f"Processing PDF: {pdf_path}")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()

    # Summarize with chunk approach
    full_text = "\n".join([doc.page_content for doc in documents])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    chunks = text_splitter.split_text(full_text)

    print("Summarizing the full document in chunks...")
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        print(f"Summarizing chunk {i + 1}/{len(chunks)}...")
        try:
            summary = summarizer(chunk)
            if summary:
                chunk_summaries.append(summary)
        except Exception as e:
            print(f"Failed to summarize chunk {i + 1}: {e}")
            continue

    consolidated_summary = " ".join(chunk_summaries) if chunk_summaries else ""

    # Vector store (optional RAG usage)
    vector_store, _ = initialize_vector_store()
    add_documents_to_store(vector_store, documents)

    return consolidated_summary, vector_store

Explanation:

PDF Loading: Uses PyPDFLoader from LangChain to load the PDF document corresponding to the provided stock ticker.
Text Splitting: Employs RecursiveCharacterTextSplitter to divide the full text into manageable chunks (500 characters with 100-character overlaps) to facilitate effective summarization without exceeding model input limits.
Summarization: Iterates through each chunk, generating summaries using the initialized summarizer. Handles exceptions to ensure robustness.
Consolidated Summary: Combines individual chunk summaries into a single, comprehensive summary of the entire 10-Q report.
Vector Store Population: Initializes the vector store and adds the document texts, enabling retrieval-based augmentation if needed.

Numeric Extraction with NER and Regex

Loading the SpaCy Model

def load_spacy_model():
    print("Loading spaCy model for NER...")
    try:
        nlp = spacy.load("en_core_web_lg")
    except OSError:
        print("SpaCy model 'en_core_web_lg' not found. Install via: `python -m spacy download en_core_web_lg`.")
        raise
    print("spaCy model loaded.")
    return nlp

Explanation:

Model Loading: Attempts to load the large English SpaCy model (en_core_web_lg) for robust NER capabilities. Provides instructions if the model is not found.

Extracting Numerics

def extract_flexible_numerics(summary_text, nlp):
    """
    Extract numeric references from summary_text using:
      - Regex patterns for certain financial keywords
      - spaCy NER for MONEY entities
    Store them in a list for a more flexible result
    """
    doc = nlp(summary_text)

    extracted_numerics = []

    # 1) Regex patterns for certain financial metrics
    regex_patterns = {
        "Regex_Revenue": r"(?:revenue(?:s)?|total\s+revenue)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
        "Regex_OperatingIncome": r"(?:operating\s+income|income\s+from\s+operations)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
        "Regex_NetIncome": r"(?:net\s+income|net\s+earnings)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
        "Regex_CostOfRevenue": r"(?:cost\s+of\s+revenue(?:s)?|cost\s+of\s+sales)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
        "Regex_TotalAssets": r"(?:total\s+assets)\s?(?:of|were|=)?\s?\$?([\d,\.]+)",
        "Regex_TotalLiabilities": r"(?:total\s+liabilities)\s?(?:of|were|=)?\s?\$?([\d,\.]+)",
        "Regex_StockholdersEquity": r"(?:stockholders'?(\s+)?equity|shareholders'?(\s+)?equity)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
        "Regex_CF_Operating": r"(?:cash\s+flow\s+from\s+operating\s+activities)\s?(?:of|=)?\s?\$?([\d,\.]+)",
        "Regex_CF_Financing": r"(?:cash\s+flow\s+from\s+financing\s+activities)\s?(?:of|=)?\s?\$?([\d,\.]+)",
        "Regex_CF_Investing": r"(?:cash\s+flow\s+from\s+investing\s+activities)\s?(?:of|=)?\s?\$?([\d,\.]+)",
    }

    for label, pattern in regex_patterns.items():
        matches = re.findall(pattern, summary_text, flags=re.IGNORECASE)
        for m in matches:
            numeric_val = m.replace(",", "")
            extracted_numerics.append({
                "label": label,
                "value": numeric_val,
                "source": "regex"
            })

    # 2) spaCy NER - record all money-like entities
    for ent in doc.ents:
        if ent.label_ == "MONEY":
            ent_text = ent.text.strip().replace(",", "")
            extracted_numerics.append({
                "label": "MONEY_NER",
                "value": ent_text,
                "source": "ner",
                "context": ent.sent.text.strip()  # optional context
            })

    return {"extracted_numerics": extracted_numerics}

Explanation:

Regex Patterns: Defines a set of regular expressions targeting specific financial metrics such as revenue, operating income, net income, etc. These patterns capture numerical values associated with these metrics.
Regex Extraction: Iterates through each pattern, finds all matches in the summary text, cleans the numeric values by removing commas, and appends them to the extracted_numerics list with appropriate labels.
NER Extraction: Utilizes SpaCy’s NER to identify and extract all entities labeled as “MONEY.” These entities are also cleaned and appended to the extracted_numerics list, including the context sentence for additional reference.

Data Cleaning

def clean_numeric_entries(extracted_dict):
    """
    Takes a dict with key 'extracted_numerics' (a list of label/value dicts).
    - Remove empty or '.' values
    - Extract first numeric substring if multiple
    - Deduplicate (label, numeric_str)
    - Convert to float then back to string (strip .0 if integer)
    """
    numerics = extracted_dict.get("extracted_numerics", [])
    cleaned = []
    seen = set()  # track (label, numeric_str)

    for entry in numerics:
        raw_val = entry.get("value", "").strip()
        label = entry.get("label", "").strip()
        source = entry.get("source", "")
        context = entry.get("context", "")

        # Ignore trivial placeholders
        if not raw_val or raw_val in {".", "$", "$ ", " "}:
            continue

        # Find the *first* numeric substring
        match = re.search(r"[\d\.]+", raw_val)
        if not match:
            continue  # skip if no numeric substring

        numeric_str = match.group(0)  # e.g. '24667.00'
        # Attempt float parse
        try:
            fval = float(numeric_str)
        except ValueError:
            continue  # skip if parse fails

        # Convert back to string, removing trailing .0 if integral
        if fval.is_integer():
            numeric_str = str(int(fval))
        else:
            numeric_str = str(fval)

        # Deduplicate
        key = (label, numeric_str)
        if key in seen:
            continue
        seen.add(key)

        cleaned.append({
            "label": label,
            "value": numeric_str,
            "source": source,
            "context": context
        })

    return {"extracted_numerics": cleaned}

Explanation:

Trivial Placeholder Removal: Filters out entries with empty values or placeholders like ‘.’, ‘$’, etc.
Numeric Extraction: Ensures only the first numeric substring is considered, handling cases where multiple numbers might be present.
Type Conversion: Converts numeric strings to floats to facilitate validation and then back to strings, removing unnecessary decimal points for integer values.
Deduplication: Eliminates duplicate entries based on the combination of label and numeric value to ensure the uniqueness of data points.
Contextual Information: Retains the context sentence for NER-extracted values to provide additional reference, which can be useful for further analysis or verification.

Saving Output

def save_output_flexible(ticker, consolidated_summary, extracted_dict):
    summary_file = f"{ticker}_10q_summary.json"
    numerics_file = f"{ticker}_10q_flex_numerics.json"

    with open(summary_file, "w") as f:
        json.dump({"ConsolidatedSummary": consolidated_summary}, f, indent=4)
    print(f"Summary saved to {summary_file}")

    with open(numerics_file, "w") as f:
        json.dump(extracted_dict, f, indent=4)
    print(f"Extracted numerics saved to {numerics_file}")

Explanation:

JSON Output: Saves the consolidated summary and the cleaned numeric insights into separate JSON files named based on the stock ticker.
Structured Storage: Ensures that the outputs are stored in a structured format, facilitating easy access, analysis, and integration with other systems or tools.

Main Function

def main():
    ticker = input("Enter the stock ticker symbol (e.g., MSFT): ").strip().upper()

    # 1) Initialize summarizer
    summarizer = initialize_summarizer_model()

    # 2) Summarize 10-Q
    consolidated_summary, vector_store = process_10q(ticker, summarizer)

    # 3) Load spaCy & do flexible numeric extraction
    nlp = load_spacy_model()
    extracted_dict = extract_flexible_numerics(consolidated_summary, nlp)

    # 4) Clean the extracted numerics
    cleaned_dict = clean_numeric_entries(extracted_dict)

    # 5) Save
    save_output_flexible(ticker, consolidated_summary, cleaned_dict)

if __name__ == "__main__":
    main()

Explanation:

User Input: Prompts the user to enter a stock ticker symbol, ensuring that the input is standardized by stripping whitespace and converting to uppercase.
Pipeline Execution: Sequentially executes the steps—initializing the summarizer, processing the 10-Q PDF, extracting numerics using NER and regex, cleaning the extracted data, and saving the outputs.
Orchestration: The main function orchestrates the entire workflow, ensuring that each component operates in the correct sequence to produce the desired outputs.

Conclusion

Building a RAG and NER-powered system for extracting structured financial data from complex documents like SEC 10-Q filings presents both significant opportunities and challenges. While RAG excels in summarizing and retrieving relevant information, integrating it with NER and regex-based extraction enables the generation of precise, structured JSON outputs that are invaluable for retail investors. This combination enhances data accessibility, accuracy, and usability, empowering investors to make informed decisions based on reliable and actionable insights.

The 10-Q Analyzer project exemplifies the effective integration of these technologies, demonstrating how AI can transform the way we interact with and interpret financial data. By addressing the challenges inherent in structured data extraction and leveraging the strengths of both RAG and NER, such systems can serve as powerful tools in the arsenal of investors, analysts, and financial professionals.

For those looking to implement similar systems, this walkthrough provides a comprehensive guide to understanding the underlying concepts and practical steps involved. As AI technologies continue to advance, the potential applications and efficiencies achievable through such integrations will only grow, paving the way for more sophisticated and impactful solutions across various domains.

Additional Resources

LangChain Documentation: LangChain
HuggingFace Transformers: HuggingFace
SpaCy Documentation: SpaCy
Chroma Vector Store: Chroma
Flan-T5 Model: Google Flan-T5

By meticulously integrating RAG and NER, the 10-Q Analyzer not only automates the extraction of critical financial data but also ensures that the outputs are structured and reliable. This blend of technologies serves as a blueprint for developing sophisticated AI-driven data extraction systems tailored to specific domain needs.

Building a 10-Q Analyzer: Part 1 | Extracting Financial Insights with AI

Table of Contents

Introduction

Understanding RAG and NER

Retrieval-Augmented Generation (RAG)

Named Entity Recognition (NER)

Combining RAG with NER for Structured JSON Output

Challenges of Structured JSON Output vs. Q&A or Chat RAG Systems

Code Walkthrough

Environment Setup

Summarization Model Initialization

Vector Store Initialization

Processing the 10-Q PDF

Numeric Extraction with NER and Regex

Loading the SpaCy Model

Extracting Numerics

Data Cleaning

Saving Output

Main Function

Conclusion

Additional Resources

2 thoughts on “Building a 10-Q Analyzer: Part 1 | Extracting Financial Insights with AI”

Leave a comment Cancel reply

Table of Contents

Introduction

Understanding RAG and NER

Retrieval-Augmented Generation (RAG)

Named Entity Recognition (NER)

Combining RAG with NER for Structured JSON Output

Challenges of Structured JSON Output vs. Q&A or Chat RAG Systems

Code Walkthrough

Environment Setup

Summarization Model Initialization

Vector Store Initialization

Processing the 10-Q PDF

Numeric Extraction with NER and Regex

Loading the SpaCy Model

Extracting Numerics

Data Cleaning

Saving Output

Main Function

Conclusion

Additional Resources

Share this:

Related

2 thoughts on “Building a 10-Q Analyzer: Part 1 | Extracting Financial Insights with AI”

Leave a comment Cancel reply