Memory leak during back-to-back inferences
jeremyfowers opened this issue · comments
I am experiencing a memory leak while running my application, which is to run an MMLU accuracy test on my Radeon 780M iGPU via DirectML.
Each inference adds tens-hundreds of megabytes to the total system memory and total graphics memory utilized, until it eventually fills up after about 50 inferences and crashes the system.
My system
- Ryzen "Pheonix" 7940HS with Radeon 780M iGPU
- 32 GB system memory
Software
- Phi-3-Mini with awq 4 bit weights
- onnxruntime-genai-directml 0.2.0
- onnxruntime-directml 1.18.0
The model is running on the Radeon 780M iGPU,
My Code
I define a generate()
function like this, that is meant to return all the response tokens to the input_ids from a prompt.
def generate(
input_folder,
input_ids,
max_new_tokens=512,
do_sample=True,
top_k=50,
top_p=1.0,
temperature=0.7,
pad_token_id=None,
):
model = og.Model(input_folder)
params = og.GeneratorParams(model)
if pad_token_id:
params.pad_token_id = pad_token_id
max_length = len(input_ids) + max_new_tokens
params.input_ids = input_ids
params.set_search_options(
do_sample=do_sample,
top_k=top_k,
top_p=top_p,
temperature=temperature,
max_length=max_length,
min_length=max_length,
)
params.try_graph_capture_with_max_batch_size(1)
generator = og.Generator(model, params)
prompt_start_time = time.perf_counter()
generator.compute_logits()
generator.generate_next_token()
prompt_end_time = time.perf_counter()
time_to_first_token = prompt_end_time - prompt_start_time
if max_new_tokens > 1:
token_gen_times = []
while not generator.is_done():
token_gen_start_time = time.perf_counter()
generator.compute_logits()
generator.generate_next_token()
token_gen_end_time = time.perf_counter()
token_gen_times.append(token_gen_end_time - token_gen_start_time)
if token_gen_times:
# List will be empty if we generated 1 or 0 tokens, and we don't
# want a divide-by-zero error in those cases
avg_token_gen_latency_s = sum(token_gen_times) / len(
token_gen_times
)
tokens_per_second = 1 / avg_token_gen_latency_s
return [generator.get_sequence(0)]
Then, I call generate(tokenizer(prompt), max_new_tokens=1)
dozens of times while running the MMLU accuracy test. Each prompt adds a bit more memory utilization until the system crashes.
Screenshots
Here is a screenshot of system and iGPU memory utilization. It is climbing like a staircase due to the memory leak, when it should be flat.
For reference, here is the exact same MMLU accuracy test code running on a Huggingface Transformers implementation of Phi-3-Mini on CPU. Memory utilization is flat, as expected.
The Question
What do I do about this memory leak? Do I need to do some explicit garbage collection in my code to make my generate()
function save to run many times in a loop?
Since you are using graph capture, can you try deleting the generator object after generation is completed?
onnxruntime-genai/examples/python/phi3-qa.py
Lines 71 to 72 in 8608d13
I already tried putting del generator and del params at the bottom of my function. I still saw a memory leak.
There has been a recent ONNX Runtime fix and a recent ONNX Runtime GenAI fix for a memory leak issue with DirectML. These fixes will be in the upcoming ONNX Runtime GenAI v0.3.0 release, which is expected to be released this week, and may fix your issue. In the meantime, you can re-build both ONNX Runtime and ONNX Runtime GenAI using the latest commits on the main branches and see if your issue is resolved.
Thanks for the heads up!
@kunal-vaishnavi are there any updates on the 0.3.0 release?
@kunal-vaishnavi, very nice meeting you in person the other day!
Today I downloaded 0.3.0 and still saw the memory leak during my MMLU test. So, I decided to dig further and found something interesting.
The memory leak presented during MMLU, but not during performance benchmarking. I dug further and found the only meaningful difference between my MMLU and benchmark code was that MMLU delivered a unique prompt on every iteration, whereas my benchmark reused the same prompt across iterations.
Here is a quick psuedocode that has no memory leak:
prompt = random_sentence() # generate a sentence of random words with between 100-200 words
for _ in range(1000):
input_ids = tokenzer.encode(prompt)
response = model.generate(input_ids)
And here is a psuedocode that does show the memory leak:
for _ in range(1000):
prompt = random_sentence() # generate a sentence of random words with between 100-200 words
input_ids = tokenzer.encode(prompt)
response = model.generate(input_ids)
The only difference between these two programs is that the plain-text prompt changes between loop iterations.
PS. I still get the memory leak even when I do not call tokenizer.decode(reponse) at all, which is why I omitted it from the examples.
@kunal-vaishnavi, very nice meeting you in person the other day!
Very nice to meet you as well!
The memory leak presented during MMLU, but not during performance benchmarking. I dug further and found the only meaningful difference between my MMLU and benchmark code was that MMLU delivered a unique prompt on every iteration, whereas my benchmark reused the same prompt across iterations.
Thank you for digging further into the memory leak. We will investigate and get back to you.
Thanks @kunal-vaishnavi! Are there any updates?
Hi @jeremyfowers, we are still investigating it. It takes bit long as it involves a few components.
OK, thank you!