In the evolving landscape of artificial intelligence, combining advanced techniques like Retrieval-Augmented Generation (RAG) and Named Entity Recognition (NER) has opened new avenues for extracting and structuring information from complex documents. This blog delves into the intricacies of building a 10-Q Analyzer—a tool I designed to process SEC 10-Q filings, summarize their content, and extract key financial metrics into a structured JSON format. We will explore the challenges of generating structured outputs compared to traditional Q&A or chat-based RAG systems, the synergy between RAG and NER, and provide a detailed walkthrough of the implementation code.
Github Repository for this project: https://github.com/wisemachine/pb-rag-10q-analyzer
Table of Contents
- Introduction
- Understanding RAG and NER
- Combining RAG with NER for Structured JSON Output
- Challenges of Structured JSON Output vs. Q&A or Chat RAG Systems
- Code Walkthrough
- Environment Setup
- Summarization Model Initialization
- Vector Store Initialization
- Processing the 10-Q PDF
- Numeric Extraction with NER and Regex
- Data Cleaning
- Saving Output
- Main Function
- Conclusion
Introduction
SEC 10-Q filings are comprehensive quarterly reports that publicly traded companies must submit, detailing their financial performance. These documents are rich in information but can be lengthy and complex, posing a challenge for retail investors seeking actionable insights. The 10-Q Analyzer project leverages the power of RAG and NER to automate the summarization of these reports and extract key financial metrics, presenting them in a structured JSON format. This approach not only saves time but also enhances the accuracy and accessibility of critical financial data.
Understanding RAG and NER
Retrieval-Augmented Generation (RAG)
RAG is a hybrid approach that combines retrieval-based methods with generative models to enhance the quality and relevance of generated content. In essence, RAG systems retrieve relevant documents or snippets from a large corpus and use this information to generate more accurate and contextually appropriate responses.
Key Components:
- Retriever: Fetches relevant documents from a knowledge base based on the input query.
- Generator: Generates responses by conditioning on both the input query and the retrieved documents.
Named Entity Recognition (NER)
NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, monetary values, etc. In the context of financial documents, NER is pivotal for extracting quantitative metrics like revenue, net income, and cash flows.
Key Components:
- Entity Detection: Identifies spans of text that constitute entities.
- Entity Classification: Categorizes detected entities into predefined types.
Combining RAG with NER for Structured JSON Output
The integration of RAG and NER enables the system to not only generate summaries of the 10-Q filings but also extract and structure key financial metrics. Here’s how the combination works:
- Document Retrieval and Summarization (RAG):
- Retrieval: The system retrieves relevant sections from the 10-Q PDF using a vector store (e.g., Chroma) that indexes document embeddings.
- Generation: A generative model (e.g., Flan-T5) summarizes the retrieved content, producing a consolidated summary of the financial report.
- Entity Extraction (NER):
- Detection and Classification: Using SpaCy’s NER capabilities, the system identifies monetary values and specific financial terms within the summary.
- Regex Patterns: Complementing NER, regular expressions are employed to capture predefined financial metrics, ensuring higher precision in extraction.
- Structured Output:
- The extracted entities and metrics are compiled into a structured JSON format, facilitating easy consumption and analysis by retail investors.
Challenges of Structured JSON Output vs. Q&A or Chat RAG Systems
While RAG systems excel in generating natural language responses for Q&A or chat interfaces, extending them to produce structured JSON outputs presents unique challenges:
- Precision vs. Flexibility:
- Q&A Systems: Focus on generating coherent and contextually relevant text, with flexibility in phrasing.
- Structured Output: Requires high precision in data extraction and adherence to a predefined schema, limiting the flexibility in responses.
- Data Validation and Consistency:
- Ensuring that extracted data points conform to expected formats (e.g., numeric values) and maintain consistency across different documents.
- Complexity of Extraction:
- Financial documents contain varied terminologies and structures, making it challenging to create comprehensive extraction rules that generalize well.
- Error Handling:
- Structured systems must robustly handle cases where expected data is missing or malformed, necessitating sophisticated error detection and correction mechanisms.
- Integration of Multiple Components:
- Combining RAG with NER and regex-based extraction requires seamless integration to ensure that the summarization and extraction processes complement each other without introducing redundancies or conflicts.
Code Walkthrough
Let’s explore the implementation of the 10-Q Analyzer project step by step. The code is organized to facilitate readability, maintainability, and scalability.
Environment Setup
import os
import json
import re
import spacy
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
DB_FOLDER = "chroma_db"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Explanation:
- Imports: The necessary libraries for NLP tasks, document loading, text splitting, vector storage, and model deployment are imported.
- Environment Variables: Using
dotenv, environment variables are loaded to manage configurations securely. - Tokenizer Parallelism: Setting
TOKENIZERS_PARALLELISMtofalseto prevent potential warnings or performance issues related to parallel processing of tokenizers.
Summarization Model Initialization
def initialize_summarizer_model():
print("Initializing summarization model...")
model_name = "google/flan-t5-base" # Alternatively, "google/flan-t5-large" for larger capacity
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
summarizer_pipeline = pipeline(
"summarization",
model=model,
tokenizer=tokenizer,
framework="pt",
max_length=150,
min_length=50
)
summarizer = HuggingFacePipeline(pipeline=summarizer_pipeline)
print("Summarization model loaded successfully!")
return summarizer
Explanation:
- Model Selection: Utilizes the
flan-t5-basemodel from Google for summarization. A larger model (flan-t5-large) can be used if computational resources permit. - Pipeline Creation: Sets up a HuggingFace summarization pipeline with specified parameters for maximum and minimum summary lengths.
- Wrapper: Wraps the pipeline using
HuggingFacePipelinefrom LangChain for seamless integration with other LangChain components.
Vector Store Initialization
def initialize_vector_store():
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
if not os.path.exists(DB_FOLDER):
os.makedirs(DB_FOLDER)
vector_store = Chroma(persist_directory=DB_FOLDER, embedding_function=embedding_model)
return vector_store, embedding_model
Explanation:
- Embedding Model: Uses
all-MiniLM-L6-v2from Sentence Transformers for generating embeddings that capture semantic meanings of text. - Chroma Vector Store: Initializes Chroma as the vector store to persist and manage document embeddings, facilitating efficient retrieval in RAG processes.
Processing the 10-Q PDF
def process_10q(ticker, summarizer):
folder_path = os.path.join("10q_docs", ticker)
if not os.path.exists(folder_path):
raise FileNotFoundError(f"No folder found for ticker: {ticker}")
pdf_files = [f for f in os.listdir(folder_path) if f.endswith(".pdf")]
if not pdf_files:
raise FileNotFoundError(f"No PDF files found in folder: {folder_path}")
pdf_path = os.path.join(folder_path, pdf_files[0])
print(f"Processing PDF: {pdf_path}")
loader = PyPDFLoader(pdf_path)
documents = loader.load()
# Summarize with chunk approach
full_text = "\n".join([doc.page_content for doc in documents])
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_text(full_text)
print("Summarizing the full document in chunks...")
chunk_summaries = []
for i, chunk in enumerate(chunks):
print(f"Summarizing chunk {i + 1}/{len(chunks)}...")
try:
summary = summarizer(chunk)
if summary:
chunk_summaries.append(summary)
except Exception as e:
print(f"Failed to summarize chunk {i + 1}: {e}")
continue
consolidated_summary = " ".join(chunk_summaries) if chunk_summaries else ""
# Vector store (optional RAG usage)
vector_store, _ = initialize_vector_store()
add_documents_to_store(vector_store, documents)
return consolidated_summary, vector_store
Explanation:
- PDF Loading: Uses
PyPDFLoaderfrom LangChain to load the PDF document corresponding to the provided stock ticker. - Text Splitting: Employs
RecursiveCharacterTextSplitterto divide the full text into manageable chunks (500 characters with 100-character overlaps) to facilitate effective summarization without exceeding model input limits. - Summarization: Iterates through each chunk, generating summaries using the initialized summarizer. Handles exceptions to ensure robustness.
- Consolidated Summary: Combines individual chunk summaries into a single, comprehensive summary of the entire 10-Q report.
- Vector Store Population: Initializes the vector store and adds the document texts, enabling retrieval-based augmentation if needed.
Numeric Extraction with NER and Regex
Loading the SpaCy Model
def load_spacy_model():
print("Loading spaCy model for NER...")
try:
nlp = spacy.load("en_core_web_lg")
except OSError:
print("SpaCy model 'en_core_web_lg' not found. Install via: `python -m spacy download en_core_web_lg`.")
raise
print("spaCy model loaded.")
return nlp
Explanation:
- Model Loading: Attempts to load the large English SpaCy model (
en_core_web_lg) for robust NER capabilities. Provides instructions if the model is not found.
Extracting Numerics
def extract_flexible_numerics(summary_text, nlp):
"""
Extract numeric references from summary_text using:
- Regex patterns for certain financial keywords
- spaCy NER for MONEY entities
Store them in a list for a more flexible result
"""
doc = nlp(summary_text)
extracted_numerics = []
# 1) Regex patterns for certain financial metrics
regex_patterns = {
"Regex_Revenue": r"(?:revenue(?:s)?|total\s+revenue)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
"Regex_OperatingIncome": r"(?:operating\s+income|income\s+from\s+operations)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
"Regex_NetIncome": r"(?:net\s+income|net\s+earnings)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
"Regex_CostOfRevenue": r"(?:cost\s+of\s+revenue(?:s)?|cost\s+of\s+sales)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
"Regex_TotalAssets": r"(?:total\s+assets)\s?(?:of|were|=)?\s?\$?([\d,\.]+)",
"Regex_TotalLiabilities": r"(?:total\s+liabilities)\s?(?:of|were|=)?\s?\$?([\d,\.]+)",
"Regex_StockholdersEquity": r"(?:stockholders'?(\s+)?equity|shareholders'?(\s+)?equity)\s?(?:of|was|=)?\s?\$?([\d,\.]+)",
"Regex_CF_Operating": r"(?:cash\s+flow\s+from\s+operating\s+activities)\s?(?:of|=)?\s?\$?([\d,\.]+)",
"Regex_CF_Financing": r"(?:cash\s+flow\s+from\s+financing\s+activities)\s?(?:of|=)?\s?\$?([\d,\.]+)",
"Regex_CF_Investing": r"(?:cash\s+flow\s+from\s+investing\s+activities)\s?(?:of|=)?\s?\$?([\d,\.]+)",
}
for label, pattern in regex_patterns.items():
matches = re.findall(pattern, summary_text, flags=re.IGNORECASE)
for m in matches:
numeric_val = m.replace(",", "")
extracted_numerics.append({
"label": label,
"value": numeric_val,
"source": "regex"
})
# 2) spaCy NER - record all money-like entities
for ent in doc.ents:
if ent.label_ == "MONEY":
ent_text = ent.text.strip().replace(",", "")
extracted_numerics.append({
"label": "MONEY_NER",
"value": ent_text,
"source": "ner",
"context": ent.sent.text.strip() # optional context
})
return {"extracted_numerics": extracted_numerics}
Explanation:
- Regex Patterns: Defines a set of regular expressions targeting specific financial metrics such as revenue, operating income, net income, etc. These patterns capture numerical values associated with these metrics.
- Regex Extraction: Iterates through each pattern, finds all matches in the summary text, cleans the numeric values by removing commas, and appends them to the
extracted_numericslist with appropriate labels. - NER Extraction: Utilizes SpaCy’s NER to identify and extract all entities labeled as “MONEY.” These entities are also cleaned and appended to the
extracted_numericslist, including the context sentence for additional reference.
Data Cleaning
def clean_numeric_entries(extracted_dict):
"""
Takes a dict with key 'extracted_numerics' (a list of label/value dicts).
- Remove empty or '.' values
- Extract first numeric substring if multiple
- Deduplicate (label, numeric_str)
- Convert to float then back to string (strip .0 if integer)
"""
numerics = extracted_dict.get("extracted_numerics", [])
cleaned = []
seen = set() # track (label, numeric_str)
for entry in numerics:
raw_val = entry.get("value", "").strip()
label = entry.get("label", "").strip()
source = entry.get("source", "")
context = entry.get("context", "")
# Ignore trivial placeholders
if not raw_val or raw_val in {".", "$", "$ ", " "}:
continue
# Find the *first* numeric substring
match = re.search(r"[\d\.]+", raw_val)
if not match:
continue # skip if no numeric substring
numeric_str = match.group(0) # e.g. '24667.00'
# Attempt float parse
try:
fval = float(numeric_str)
except ValueError:
continue # skip if parse fails
# Convert back to string, removing trailing .0 if integral
if fval.is_integer():
numeric_str = str(int(fval))
else:
numeric_str = str(fval)
# Deduplicate
key = (label, numeric_str)
if key in seen:
continue
seen.add(key)
cleaned.append({
"label": label,
"value": numeric_str,
"source": source,
"context": context
})
return {"extracted_numerics": cleaned}
Explanation:
- Trivial Placeholder Removal: Filters out entries with empty values or placeholders like ‘.’, ‘$’, etc.
- Numeric Extraction: Ensures only the first numeric substring is considered, handling cases where multiple numbers might be present.
- Type Conversion: Converts numeric strings to floats to facilitate validation and then back to strings, removing unnecessary decimal points for integer values.
- Deduplication: Eliminates duplicate entries based on the combination of label and numeric value to ensure the uniqueness of data points.
- Contextual Information: Retains the context sentence for NER-extracted values to provide additional reference, which can be useful for further analysis or verification.
Saving Output
def save_output_flexible(ticker, consolidated_summary, extracted_dict):
summary_file = f"{ticker}_10q_summary.json"
numerics_file = f"{ticker}_10q_flex_numerics.json"
with open(summary_file, "w") as f:
json.dump({"ConsolidatedSummary": consolidated_summary}, f, indent=4)
print(f"Summary saved to {summary_file}")
with open(numerics_file, "w") as f:
json.dump(extracted_dict, f, indent=4)
print(f"Extracted numerics saved to {numerics_file}")
Explanation:
- JSON Output: Saves the consolidated summary and the cleaned numeric insights into separate JSON files named based on the stock ticker.
- Structured Storage: Ensures that the outputs are stored in a structured format, facilitating easy access, analysis, and integration with other systems or tools.
Main Function
def main():
ticker = input("Enter the stock ticker symbol (e.g., MSFT): ").strip().upper()
# 1) Initialize summarizer
summarizer = initialize_summarizer_model()
# 2) Summarize 10-Q
consolidated_summary, vector_store = process_10q(ticker, summarizer)
# 3) Load spaCy & do flexible numeric extraction
nlp = load_spacy_model()
extracted_dict = extract_flexible_numerics(consolidated_summary, nlp)
# 4) Clean the extracted numerics
cleaned_dict = clean_numeric_entries(extracted_dict)
# 5) Save
save_output_flexible(ticker, consolidated_summary, cleaned_dict)
if __name__ == "__main__":
main()
Explanation:
- User Input: Prompts the user to enter a stock ticker symbol, ensuring that the input is standardized by stripping whitespace and converting to uppercase.
- Pipeline Execution: Sequentially executes the steps—initializing the summarizer, processing the 10-Q PDF, extracting numerics using NER and regex, cleaning the extracted data, and saving the outputs.
- Orchestration: The
mainfunction orchestrates the entire workflow, ensuring that each component operates in the correct sequence to produce the desired outputs.
Conclusion
Building a RAG and NER-powered system for extracting structured financial data from complex documents like SEC 10-Q filings presents both significant opportunities and challenges. While RAG excels in summarizing and retrieving relevant information, integrating it with NER and regex-based extraction enables the generation of precise, structured JSON outputs that are invaluable for retail investors. This combination enhances data accessibility, accuracy, and usability, empowering investors to make informed decisions based on reliable and actionable insights.
The 10-Q Analyzer project exemplifies the effective integration of these technologies, demonstrating how AI can transform the way we interact with and interpret financial data. By addressing the challenges inherent in structured data extraction and leveraging the strengths of both RAG and NER, such systems can serve as powerful tools in the arsenal of investors, analysts, and financial professionals.
For those looking to implement similar systems, this walkthrough provides a comprehensive guide to understanding the underlying concepts and practical steps involved. As AI technologies continue to advance, the potential applications and efficiencies achievable through such integrations will only grow, paving the way for more sophisticated and impactful solutions across various domains.
Additional Resources
- LangChain Documentation: LangChain
- HuggingFace Transformers: HuggingFace
- SpaCy Documentation: SpaCy
- Chroma Vector Store: Chroma
- Flan-T5 Model: Google Flan-T5
By meticulously integrating RAG and NER, the 10-Q Analyzer not only automates the extraction of critical financial data but also ensures that the outputs are structured and reliable. This blend of technologies serves as a blueprint for developing sophisticated AI-driven data extraction systems tailored to specific domain needs.
2 thoughts on “Building a 10-Q Analyzer: Part 1 | Extracting Financial Insights with AI”