artificial-intelligence large-language-models streamlit

Image to Audio Story

This Streamlit app converts an uploaded image into an audio story. It utilizes machine learning models for image captioning and text generation, along with the Hugging Face API for text-to-speech conversion.

How it Works

Image Upload: Users can upload an image of their choice.
Image Captioning: The app generates a descriptive caption for the uploaded image.
Story Generation: Based on the generated caption, a short story (within 100 words) is created.
Text-to-Speech Conversion: The story is then converted into an audio file using the Hugging Face API.
Output Display: The generated scenario, story, and audio file are displayed to the user.

Setup

To run the app locally:

Clone this repository.
Install the required dependencies using pip install -r requirements.txt.
Set up environment variables with your Hugging Face API token.
Run the main script app.py.

Dependencies

Usage

Upload an image using the file uploader.
Wait for the image caption and story to be generated.
Listen to the audio story.

Acknowledgments

Hugging Face for providing the models and API used in this app.
Salesforce/blip-image-captioning-large for image captioning.
meta-llama/Meta-Llama-3-8B for text generation.
espnet/kan-bayashi_ljspeech_vits for text-to-speech conversion.

Run Online

You can run this project online here.
Note: We are using GPT-2 from the openai-community/gpt2 for deploying this project as it's lighter as compared to LLAMA 3. Another reason for using it instead is because its not possible to load model shard checkpoints when streaming an application using streamlit. You can use an inference API to solve this issue, however you require a paid huggingface subscription.
If the streamed application does not work as intended/shows an error code (which is highly likely considering the fact that I am not using paid subscriptions for either accessing the models or deploying the application as well as i am not using a dedicated database to store the images and the audio files generated), feel free to contact me via my email or my socials. If anyone has any suggestions on how to rectify this, I would greatly appreciate your assistance.

How the project looks/works

2024-04-22.19-32-27_1.mp4

About

This is a Streamlit app that takes an image as input, generates a caption for the image, and then generates a story based on the caption. The story is then converted to an audio file using the Hugging Face API as well as open source large language models.

artificial-intelligence large-language-models streamlit

MIT License

Languages

Language:Python 100.0%