SciPhi-AI / R2R

The all-in-one solution for RAG. Build, scale, and deploy state of the art Retrieval-Augmented Generation applications

Home Page:https://r2r-docs.sciphi.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Embed media like images, audio, 3d, video or etc?

fire opened this issue · comments

Hi,

I was wondering if it was in scope to embed media?

That's definitely in scope. The best way to approach this would be to introduce the necessary embedding providers and to modify or create a new pipeline that shows an example of this in action.

I'm happy to team up on this.

I have two primary usecases:

  1. The basic use-case is taking an image and making it an embedding for use. Like stable diffusion or the various combined vision-text models. There are a few models that can also also do video.
  2. My pet emerging technologies use-case is to take a 3d mesh from https://github.com/lucidrains/meshgpt-pytorch and have it auto complete vertices or search a database of other embedded meshes using the mesh-token-embedding.
  3. Someday maybe: audio, speech. I am not familiar at all with this.

For image embedding, do you think we can fit it into the pipeline here [https://github.com/SciPhi-AI/R2R/blob/main/r2r/pipelines/basic/ingestion.py] with a specific embedding provider, or do you think we need to fundamentally rework the structure of the codebase in some way?

I think multi-modal is an important use case and I am very interested in figuring out how to best support this.

I don't think I can drive multi-modal too much, but I'll see what spare time I can gather.

The obvious question are like what happens when we have two different embedding models like token integers, how do we sync them?