Information Retrieval: Who wins, GPT-4-Turbo or a RAG based on GPT4?

This is an extension of the "Needle in a Haystack" test created by Greg Kamradt. The goal is to compare GPT4-Turbo with its 128k tokens context length and a RAG system based on GPT4 forr context retrieval The result is clear: 🏆 𝗥𝗔𝗚 𝘄𝗶𝗻𝘀 🏆 > its edge becomes clear for the longest document sizes.

The Test

Place a random fact or statement (the 'needle') in the middle of a long context window
Ask the model to retrieve this statement
Iterate over various document depths (where the needle is placed) and context lengths to measure performance

The key pieces:

needle : The random fact or statement you'll place in your context
question_to_ask: The question you'll ask your model which will prompt it to find your needle/statement
results_version: Set to 1. If you'd like to run this test multiple times for more data points change this value to your version number
context_lengths (List[int]): The list of various context lengths you'll test. In the original test this was set to 15 evenly spaced iterations between 1K and 128K (the max)
document_depth_percents (List[int]): The list of various depths to place your random fact
model_to_test: The original test chose gpt-4-1106-preview. You can easily change this to any chat model from OpenAI, or any other model w/ a bit of code adjustments

Results Visualization

(Made via pivoting the results, averaging the multiple runs, and adding labels in google slides)

About

Comparing retrieval abilities from GPT4-Turbo and a RAG system on a toy example for various context lengths

Languages

Language:Jupyter Notebook 69.0%Language:Python 31.0%