milvus-io / bootcamp

Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.

Home Page:https://milvus.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot use annotations/citations as CONTEXT

adorosario opened this issue · comments

https://github.com/milvus-io/bootcamp/blame/b794151ccb64ae46419a21cf897282cd8818fd6e/evaluation/evaluate_fiqa_openai.ipynb#L198

Great job with this benchmarking. Nicely done.

One problem though: You cannot use the annotations/citations from a RAG SaaS agent like OpenAI Assistants or CustomGPT.ai as CONTEXT for the sake of benchmarking. That is wrong. RAG SaaS agents don't expose their contexts, so you cannot "guess" the context and use annotations as a proxy -- and so any of the metrics in your analysis that refer to CONTEXT do not hold.

Besides that, the answer_similarity and answer_correctness (which ragas has for End-to-End RAG evaluation) -- those two are fine.

IMG_9180
IMG_9179

For example: see similar comparison for black box RAG agents from Tonic, see this (only answer similarity is available)

Hi @adorosario, Shahul from Ragas here. Not sure If I understand your concern currently. But the context in ragas refers to retrieved context. Ragas metrics (that require context like faithfulness) can only be used with this information being available.

@adorosario thanks for your advice. As far as I know, openai provides context: https://platform.openai.com/docs/assistants/how-it-works/message-annotations
so it seem not a black box RAG agent.

@zc277584121

@adorosario thanks for your advice. As far as I know, openai provides context: https://platform.openai.com/docs/assistants/how-it-works/message-annotations so it seem not a black box RAG agent.

Sorry - I think you might be mis-understanding the difference between "annotations" and "context".

Context is the (possibly) thousands of tokens/words from the knowledge base that is used to create the response. For example, in our RAG platform, we could technically use tens of thousands of words of context (think of it as 20-30 pages) to create the response. This CONTEXT is never shown to the client in a black-box RAG service like OpenAI assistants or CustomGPT.ai

Annotations are tiny snippets of text from the knowledge base that were used to construct the answer .. they are typically about 10% (if at all) of the context that was used.

Using them interchangeably in a black-box RAG would be wrong and skew your benchmarks.

@shahules786

Shahul from Ragas here. Not sure If I understand your concern currently. But the context in ragas refers to retrieved context. Ragas metrics (that require context like faithfulness) can only be used with this information being available.

Shahul -- yes - in a black box RAG service like OpenAI assistants or CustomGPT.ai, the retrieved CONTEXT is never available to the client -- so any benchmark metric that involves retrieved CONTEXT cannot be computed (which ragas does correctly). Using annotations and context interchangeably to compute any metric like faithfulness would be wrong - that is the reason for raising this issue.