Objectives This notebook provides a guide to building a document Q&A engine using multimodal retrieval augmented generation (RAG), step by step:
- Extract and store metadata of documents containing both text and images, and generate embeddings the documents
- Search the metadata with text queries to find similar text or images
- Search the metadata with image queries to find similar images
- Using a text query as input, search for contexual answers using both text and images
References
https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/overview
https://www.mongodb.com/products/platform/atlas-vector-search
https://blog.langchain.dev/semi-structured-multi-modal-rag/
https://unstructured-io.github.io/unstructured/introduction.html