ekaj2 / multimodal-gpt

A Screenshot-based Multimodal GPT Assistant

Python sounddevice for recording audio until you stop speaking
Whisper API for transcribing audio
OpenAI TTS for speech
PyWinCtl and pyautogui for screenshots of a specific window
OpenAI Vision API to process the screenshot and answer your prompt

Installation

python -m venv venv
. venv/bin/activate
pip install -r requirements.txt

Run

python main.py

Configuration

All project-wide settings are in settings.py.

About

Other

Languages

Language:Python 100.0%