FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

Siddhant Sukhani^1,2,*, Yash Bhardwaj^2,*, Riya Bhadani², Veer Kejriwal², Michael Galarnyk², Sudheer Chava²
¹Stanford University ²Georgia Institute of Technology
^*Authors contributed equally
Corresponding author: sukhani@stanford.edu

This repository contains the code and data for the research paper "FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos" that evaluates multimodal large language models (MLLMs) for topic-aligned captioning in short-form financial videos.

Abstract

We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found in this Github under the CC-BY-NC-SA 4.0 license.

_{Figure 1. End-to-end pipeline: (1) Extract text, audio, and video frames from
VideoConviction, leading to 624 SVs;
(2) prompt MLLMs with all non-empty modality pairs to generate topic-aligned captions;
(3) evaluate captions with G-VEval for accuracy and F1 score on ticker–action pairs.}

Key Contributions and Findings

Proposed the first framework for topic-aligned captioning in financial short-form videos across modality combinations.
Benchmarked five multimodal large language models (Qwen2.5-omni, Gemini 2.0-flash, Gemini 2.5-flash, GPT-4o mini, GPT-5 mini) across seven modality combinations (text, audio, video, and their fusions).
Video modality provides the strongest signal for visual analysis and financial entity recognition, while combining video with text or audio improves performance on main recommendation, sentiment analysis, and video purpose tasks.
Selective modality fusion outperforms full fusion by minimizing noise.
Released benchmarks, an evaluation pipeline using F1-scores and G-VEval, and structured baselines for understanding financial short-form videos.

Repository Structure

repo-root/
├── model_outputs/                  # Model outputs organized by modality combination
│   └── gemini/                      # Example model family
│       └── gemini_2.0_flash/        # Specific model variant
│           ├── T/                   # Transcript only
│           ├── A/                   # Audio only
│           ├── V/                   # Video only
│           ├── TA/                  # Transcript + Audio
│           ├── TV/                  # Transcript + Video
│           ├── AV/                  # Audio + Video
│           └── VAT/                 # Video + Audio + Transcript
│               └── 0CJU8R4oNFk__197.7986688222514__267.77438718612933.json
│                   └── { ... }      # Sample JSON output (see example below)
│
├── model_prompting/                 # Jupyter notebooks for prompting different models
│   ├── gemini_prompting.ipynb
│   ├── gpt_visual_prompting.ipynb
│   └── qwen_2.5_omni_prompting.ipynb
│
└── evaluation/                      # Evaluation scripts and pipelines
    └── eval_pipeline.ipynb

Example JSON (`model_outputs/gemini/gemini_2.0_flash/VAT/0CJU8R4oNFk197.7986688222514267.77438718612933.json`)

{
  "action": "Buy",
  "main_recommendation": "The video recommends buying ZYXI (Xynex) stock, which was previously recommended last October under $10. The stock jumped to over $24 per share, but has come down to $16 per share. Analysts have price targets from $22.50 to $30, projecting a potential 82% return, and the presenter expresses intent to buy the stock at $22 per share.",
  "visual_analysis": "The video includes stock charts showing the historical performance and analyst price targets for ZYXI. The analyst price target chart shows a range of potential price appreciation for the stock. The presenter appears in front of a background with books and framed images, and on-screen text appears at the bottom of the video at the end promoting subscribing to the channel.",
  "sentiment_assessment": "The sentiment is bullish, with the presenter expressing confidence in the potential of ZYXI stock. The tone is positive and encouraging, emphasizing the company's financial position and potential returns. The mention of consecutive profitable quarters and a special dividend contribute to a sense of optimism.",
  "video_purpose": "The primary purpose is informational, providing analysis and a buy recommendation for ZYXI stock. The video shares financial data, analyst price targets, and company performance metrics to support the recommendation. The target audience appears to be retail investors interested in small-cap stocks with growth potential.",
  "key_financial_entities": "Key financial entities include Xynex (ZYXI), trading at $16.62, with analyst price targets ranging from $22.50 to $30. The company has booked 12 consecutive profitable quarters and has $10 million in balance sheet cash with no long-term debt. The presenter mentions a special dividend paid out in the fourth quarter of the previous year and notes that orders were up 65% in the first half of 2019."
}

Installation

Option 1: Using Conda (Recommended)

# Clone the repository
git clone https://github.com/yourusername/FinCap.git
cd FinCap

# Create conda environment
conda env create -f environment.yml
conda activate fincap

Option 2: Using pip

# Clone the repository
git clone https://github.com/yourusername/FinCap.git
cd FinCap

# Create virtual environment
python -m venv fincap_env
source fincap_env/bin/activate  # On Windows: fincap_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Ensure that you have FFmpeg installed on your system.

Usage

Model Prompting

The repository includes notebooks for three different model families:

Gemini: model_prompting/gemini_prompting.ipynb
GPT: model_prompting/gpt_visual_prompting.ipynb
Qwen-2.5 Omni: model_prompting/qwen_2.5_omni_prompting.ipynb

Each notebook supports different modality combinations (except OpenAI models, which do not support audio):

T: Transcript only
A: Audio only
V: Video only
TA: Transcript + Audio
TV: Transcript + Video
AV: Audio + Video
VAT: Video + Audio + Transcript

Captions can be generated across five key topics:

Main Recommendation (MR): Stock ticker, action, prices, time frame, rationale
Sentiment Analysis (SA): Tone, confidence, urgency
Video Purpose (VP): Purpose, target audience
Visual Analysis (VIA): Charts, indicators, text, gestures
Financial Entity Recognition (FE): Company name, stock tickers, numerical values

Evaluation

Run the evaluation notebook to assess model performance:

jupyter notebook evaluation/eval_pipeline.ipynb

We evaluate generated captions using G-VEval to calculate accuracy and relevance. Additionally, we also compute F1-score for stock ticker-action extraction.

Citation

If you use this work in your research, please cite:

@inproceedings{sukhani2025fincap,
    title     = {FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos},
    author    = {Sukhani, Siddhant and Bhardwaj, Yash and Bhadani, Riya and Kejriwal, Veer and Galarnyk, Michael and Chava, Sudheer},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    year      = {2025},
    series    = {Short-Form Video Understanding: The Next Frontier in Video Intelligence},
}

License

The dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC-BY-NC-SA 4.0) license, which allows others to share, copy, distribute, and transmit the work, as well as to adapt the work, provided that appropriate credit is given, a link to the license is provided, and any changes made are indicated.

Contact

For questions or issues, please contact sukhani@stanford.edu or open an issue on GitHub.

jon-chun / FinCap

FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

Abstract

Key Contributions and Findings

Repository Structure

Example JSON (`model_outputs/gemini/gemini_2.0_flash/VAT/0CJU8R4oNFk197.7986688222514267.77438718612933.json`)

Installation

Option 1: Using Conda (Recommended)

Option 2: Using pip

Usage

Model Prompting

Evaluation

Citation

License

Contact

About

Languages

FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

Abstract

Key Contributions and Findings

Repository Structure

Example JSON (model_outputs/gemini/gemini_2.0_flash/VAT/0CJU8R4oNFk__197.7986688222514__267.77438718612933.json)

Installation

Option 1: Using Conda (Recommended)

Option 2: Using pip

Usage

Model Prompting

Evaluation

Citation

License

Contact

About

Languages

Example JSON (`model_outputs/gemini/gemini_2.0_flash/VAT/0CJU8R4oNFk197.7986688222514267.77438718612933.json`)