prateekralhan / FinancialQnA

A comprehensive comparative analysis system that implements and evaluates two approaches for answering questions based on company financial statements

Home Page:https://huggingface.co/spaces/vibertron/Financial_QnA

Repository from Github https://github.comprateekralhan/FinancialQnARepository from Github https://github.comprateekralhan/FinancialQnA

S2-24_AIMLCZG521 - Conversational AI | BITS Pilani WILP

Group No. - 110

Name StudentID Contribution %
JOSHI NIRANJAN SURYAKANT 2023AC05011 100%
PRATEEK RALHAN 2023AC05673 100%
KESHARKAR SURAJ SANJAY 2023AD05004 100%
SAURABH SUNIT JOTSHI 2023AC05565 100%
KILLI SATYA PRAKASH 2023AC05066 100%

๐Ÿ“Š Financial QA System: RAG vs Fine-tuning Comparison Project Status: Active

A comprehensive comparative analysis system that implements and evaluates two approaches for answering questions based on company financial statements:

  1. Retrieval-Augmented Generation (RAG) Chatbot: Combines document retrieval and generative response
  2. Fine-Tuned Language Model (FT) Chatbot: Directly fine-tunes a small open-source language model on financial Q&A

๐Ÿ‘‰ ๐ŸŽฌ Live WebApp๐Ÿ”—

๐Ÿ‘‰ ๐Ÿ“ Architecture Summary Document

๐ŸŽฏ Objective

Develop and compare two systems for answering questions based on company financial statements (last two years) using the same financial data for both methods and perform a detailed comparison on accuracy, speed, and robustness.

โœจ Key Features

๐Ÿ” RAG System

  • Hybrid Retrieval: Combines dense (vector) and sparse (BM25) retrieval methods
  • Memory-Augmented Retrieval: Persistent memory bank for frequently asked questions
  • Advanced Guardrails: Input and output validation systems
  • Multi-source Retrieval: FAISS vector database + ChromaDB integration
  • Document Chunking: Intelligent text segmentation with configurable chunk sizes

๐ŸŽฏ Fine-Tuned System

  • Continual Learning: Incremental fine-tuning without catastrophic forgetting
  • Domain Adaptation: Specialized for financial Q&A domain
  • Efficient Training: Optimized hyperparameters for small models
  • Confidence Scoring: Built-in confidence estimation
  • Model Persistence: Save and load fine-tuned models

๐Ÿ“Š Evaluation & Comparison

  • Comprehensive Metrics: Accuracy, response time, confidence, factuality
  • Visualization: Interactive charts and performance comparisons
  • Test Suite: Diverse question types (relevant high/low confidence, irrelevant)
  • ROUGE Scoring: Text similarity metrics for quality assessment

๐Ÿ–ฅ๏ธ User Interface

  • Streamlit Web App: Modern, responsive interface
  • Real-time Comparison: Side-by-side RAG vs Fine-tuned results
  • Interactive QA: Ask questions and get instant responses
  • Performance Dashboard: Live metrics and visualizations

๐Ÿ—๏ธ System Architecture

Financial QA System
โ”œโ”€โ”€ Data Processing
โ”‚   โ”œโ”€โ”€ PDF Extraction (pdfplumber, PyPDF2)
โ”‚   โ”œโ”€โ”€ Text Cleaning & Segmentation
โ”‚   โ”œโ”€โ”€ Q&A Pair Generation
โ”‚   โ””โ”€โ”€ Chunking for RAG
โ”œโ”€โ”€ RAG System
โ”‚   โ”œโ”€โ”€ Hybrid Retrieval (FAISS + BM25)
โ”‚   โ”œโ”€โ”€ Memory-Augmented Retrieval
โ”‚   โ”œโ”€โ”€ Response Generation (DistilGPT2)
โ”‚   โ””โ”€โ”€ Guardrails (Input/Output)
โ”œโ”€โ”€ Fine-Tuned System
โ”‚   โ”œโ”€โ”€ Continual Learning
โ”‚   โ”œโ”€โ”€ Domain Adaptation
โ”‚   โ”œโ”€โ”€ Model Training & Persistence
โ”‚   โ””โ”€โ”€ Confidence Estimation
โ”œโ”€โ”€ Evaluation System
โ”‚   โ”œโ”€โ”€ Performance Metrics
โ”‚   โ”œโ”€โ”€ Comparative Analysis
โ”‚   โ”œโ”€โ”€ Visualization Generation
โ”‚   โ””โ”€โ”€ Results Export
โ””โ”€โ”€ User Interface
    โ”œโ”€โ”€ Streamlit Web App
    โ”œโ”€โ”€ Interactive QA
    โ”œโ”€โ”€ System Comparison
    โ””โ”€โ”€ Performance Dashboard

๐Ÿš€ Installation

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (optional, for faster training)

1. Clone the Repository

git clone <repository-url>
cd financial-qa-system

2. Create Virtual Environment

python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Download Required Models

The system will automatically download required models on first run:

  • all-MiniLM-L6-v2 (sentence embeddings)
  • distilgpt2 (generation model)
  • distilbert-base-uncased (classification)

๐Ÿ“– Usage

Command Line Interface

1. Data Processing Only

python main.py data

2. RAG System Only

python main.py rag

3. Fine-tuning Only

python main.py fine-tune

4. Comprehensive Evaluation

python main.py evaluate

5. Web Interface

python main.py interface

6. Complete Pipeline

python main.py all

Web Interface

  1. Start the interface:

    python main.py interface
  2. Open your browser and navigate to the displayed URL

  3. Select your preferred system:

    • RAG System
    • Fine-tuned System
    • Both (Comparison)
  4. Ask questions and view results in real-time

๐Ÿ“Š System Comparison

RAG System Strengths

  • Adaptability: Easy to update with new documents
  • Factual Grounding: Direct access to source documents
  • Transparency: Clear source attribution
  • Flexibility: Handles diverse question types

Fine-Tuned System Strengths

  • Speed: Faster inference after training
  • Fluency: More natural, coherent responses
  • Efficiency: Lower computational overhead
  • Specialization: Domain-specific knowledge

Trade-offs

  • RAG: Higher accuracy, slower response, more resource-intensive
  • Fine-tuned: Lower accuracy, faster response, more efficient

๐Ÿ”ง Configuration

Training Parameters

@dataclass
class TrainingConfig:
    model_name: str = "distilgpt2"
    learning_rate: float = 5e-5
    batch_size: int = 4
    num_epochs: int = 3
    max_length: int = 512
    warmup_steps: int = 100
    weight_decay: float = 0.01

RAG Parameters

  • Chunk Size: Configurable text segmentation (100-400 tokens)
  • Top-K Retrieval: Number of chunks to retrieve (default: 5)
  • Dense Weight: Weight for vector similarity vs BM25 (default: 0.7)

๐Ÿ“ˆ Evaluation Metrics

Performance Metrics

  • Accuracy: Correct answer rate
  • Response Time: Average inference speed
  • Confidence: Model confidence scores
  • Factuality: Response reliability assessment

Quality Metrics

  • ROUGE Scores: Text similarity metrics
  • Source Attribution: Document source tracking
  • Validation Status: Input/output guardrail results

๐Ÿ“ Project Structure

financial-qa-system/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ data_processor.py      # Document processing & Q&A generation
โ”‚   โ”œโ”€โ”€ rag_system.py          # RAG implementation
โ”‚   โ”œโ”€โ”€ fine_tune_system.py    # Fine-tuning implementation
โ”‚   โ”œโ”€โ”€ evaluation_system.py   # Evaluation & comparison
โ”‚   โ””โ”€โ”€ interface.py           # Streamlit web interface
โ”œโ”€โ”€ financial_statements/      # Input PDF documents
โ”œโ”€โ”€ processed_data/            # Processed texts & Q&A pairs
โ”œโ”€โ”€ evaluation_results/        # Evaluation outputs & visualizations
โ”œโ”€โ”€ main.py                    # Main execution script
โ”œโ”€โ”€ requirements.txt           # Python dependencies
โ””โ”€โ”€ README.md                 # This file

๐Ÿงช Testing

Test Questions Categories

  1. Relevant, High-Confidence: Clear facts in financial data
  2. Relevant, Low-Confidence: Ambiguous or sparse information
  3. Irrelevant: Questions outside financial scope

Example Test Questions

  • "What was the company's revenue in 2024?"
  • "What are the total assets?"
  • "What type of company is this?"
  • "What is the capital of France?" (irrelevant)

๐Ÿ”’ Guardrails

Input Guardrails

  • Relevance Check: Validates financial/company-related queries
  • Harmful Content: Filters potentially dangerous inputs
  • Query Validation: Ensures proper question format

Output Guardrails

  • Factuality Check: Detects hallucinated responses
  • Confidence Threshold: Flags low-confidence outputs
  • Contradiction Detection: Identifies conflicting statements

๐Ÿš€ Advanced Features

Memory-Augmented Retrieval

  • Persistent memory bank for frequent Q&A patterns
  • Automatic categorization and retrieval
  • Confidence-based response selection

Continual Learning

  • Incremental fine-tuning on new data
  • Catastrophic forgetting prevention
  • Domain adaptation capabilities

Hybrid Retrieval

  • Dense retrieval (sentence embeddings)
  • Sparse retrieval (BM25)
  • Weighted score fusion

๐Ÿ“Š Results Example

Question Method Answer Confidence Time (s) Correct (Y/N)
Revenue in 2024? RAG $391.0B 0.93 9.11 Y
Revenue in 2024? Fine-Tune $391.0B 0.91 21.23 Y
Total sales(iphones)? RAG $182.2B 0.89 4.22 N
Total sales(iphones)? Fine-Tune $201.2B 0.92 44.12 Y
Capital of France? RAG blank response 0.35 11.2 Y
Capital of France? Fine-Tune Paris 0.22 3.47 N

๐Ÿ™ Acknowledgments

  • Hugging Face: Transformers library and model hub
  • Sentence Transformers: Embedding models
  • FAISS: Vector similarity search
  • Streamlit: Web interface framework
  • Apple Inc.: Financial statement data for testing

๐Ÿ”ฎ Future Enhancements

  • Multi-modal Support: Image and table extraction from PDFs

  • Real-time Updates: Live document ingestion and processing

  • Advanced Guardrails: More sophisticated validation systems

  • Model Compression: Quantization and distillation for efficiency

  • API Integration: RESTful API for external applications

  • Multi-language Support: Internationalization capabilities

About

A comprehensive comparative analysis system that implements and evaluates two approaches for answering questions based on company financial statements

https://huggingface.co/spaces/vibertron/Financial_QnA

License:Apache License 2.0


Languages

Language:Python 100.0%