Photo to Recipe generation with Multi-agents

This project leverages generative AI agents to generate recipes from food images. By utilizing ️LangGraph, various LLM-powered tools and conditional workflows, the application can extract ingredients, retrieve relevant documents, generate recipes, and have self-supervised workflows to correct mistakes and errors in generation.

Demo Video

Related Papers

Routing: Adaptive RAG (paper). Route questions to different type of retrieval
Self-correction: Self-RAG (paper). Fix answers that either contain hallucinations or don't answer the question
LLM Critics Help Catch LLM Bugs LLM-Critic (paper). This research trains AI "critics" to assist humans in evaluating code written by other AI models for more accurate evaluations.

Credits and Inspiration

NVIDIA/GenerativeAIExamples
LangGraph_HandlingAgent_IntermediateSteps
Agent_use_tools_leveraging_NVIDIA_AI_endpoints.ipynb
LangChain NVIDIA Integration
Scenario for Image Assets generation
Elevent Labs for Audio in the demo video

How to Run

The project is created with Langchain/Langgraph and can be run with docker compose

To run this project, you only need to use Docker Compose. Follow the steps below to get started.

Prerequisites

Nvidia API key is provided through .env file
Ensure you have Docker and Docker Compose installed on your machine.

Steps to Run

Clone the Repository:

git clone git@github.com:ttback/photo-to-recipe.git
cd photo-to-recipe

Set up NVIDIA_API_KEY key in .env file, see .env.example
Build and Run the Docker Containers:

docker compose up

Run it in browser: localhost:7860

The images in images folder can be used to test out basic workflow with burger, sushi and non-food photo from the Nvidia example for image caption. The vector db contains burger recipes only, so sushi can be used to test for most complete workflow where the initial RAG-based generation will be rejected and the ADDA team will re-generate recipe with non-RAG based process.

Key Multi-agent Features

Unsupervised Image Type detection: Handle food vs. non-food image without user interaction
Automatic Ingredient Extraction from Food Photo: Using latest multi-modal SLM (microsoft/phi-3-vision-128k-instruct) to extract ingredient from food image
Document Retrieval: Transform online web pages to vector store via langchain and Nvidia's embedding model, NV-Embed-QA
Conditional (RAG or no-RAG) generation: Check whether the retrieved documents are relevant for the recipe generation process, before proceeding with RAG-based generation. If for some reasons, the web urls changed content, or are unavailable, ADDA team is smart enough to avoid RGA-based generation
RAG-based recipe generation: Using retrieved documents, the writer agent will generate recipe.
Automated Hallucation checker: Agents will check whether generated recipe is grounded by documents and is for the food and ingredients detected in the input image.

AI Agents and LLM power tools

Role	Description	Tools
Reader	Reads Image Content	image_router ingredients_recognizer image_caption
Searcher	Searches in Archive(VectorDB)	doc_retriever relevance_grader
Writer	Writes Recipe	rag_recipe_generator recipe_generator
Reviewer	Reviews Recipe	hallucination_grader answer_grader

Tools

Tool	Description	Model
`image_router`	Routes the image to the appropriate processing path based on its content.	`microsoft/phi-3-vision-128k-instruct`
`ingredients_recognizer`	Extracts ingredients from the image.	`microsoft/phi-3-vision-128k-instruct`
`image_caption`	Generates a caption for the image.	`microsoft/phi-3-vision-128k-instruct`
`doc_retriever`	Retrieves documents from a vector store based on the question, downloading from food.com.	`NV-Embed-QA`
`relevance_grader`	Grades the relevance of retrieved documents to the question.	`meta/llama3-70b-instruct`
`rag_recipe_generator`	Generates a recipe using RAG on retrieved documents.	`meta/llama3-70b-instruct`
`recipe_generator`	Generates a recipe without using RAG.	`mistralai/mixtral-8x7b-instruct-v0.1`
`hallucination_grader`	Grade for hallucinations in the generated recipe.	`meta/llama3-70b-instruct`
`answer_grader`	Grades the generated recipe against the documents and question.	`meta/llama3-70b-instruct`

Diagram

graph TD
    A[Start] --> B{Is it a food image?}
    B -->|Yes| C[Extract Ingredients]
    B -->|No| D[Image Caption]
    C --> E[Retrieve Recipe Documents]
    E --> F{Are most recipe documents relevant?}
    F -->|Yes| G[Generate Recipe using RAG]
    F -->|No| H[Generate Recipe without RAG]
    G --> I{Is the RAG generation grounded in documents?}
    I -->|Yes| J{Does the RAG generation address the question?}
    I -->|No| H  
    J -->|Yes| K[End]
    J -->|No| H
    D --> L[End]
    H --> K

About

photo to recipe generation with multi-agents

MIT License

Languages

Language:Python 99.1%Language:Dockerfile 0.9%