microsoft / onnxruntime-genai

I am experiencing a memory leak while running my application, which is to run an MMLU accuracy test on my Radeon 780M iGPU via DirectML.

Each inference adds tens-hundreds of megabytes to the total system memory and total graphics memory utilized, until it eventually fills up after about 50 inferences and crashes the system.

My system

Ryzen "Pheonix" 7940HS with Radeon 780M iGPU
32 GB system memory

Software

Phi-3-Mini with awq 4 bit weights
onnxruntime-genai-directml 0.2.0
onnxruntime-directml 1.18.0

The model is running on the Radeon 780M iGPU,

My Code

I define a generate() function like this, that is meant to return all the response tokens to the input_ids from a prompt.

def generate(
        input_folder,
        input_ids,
        max_new_tokens=512,
        do_sample=True,
        top_k=50,
        top_p=1.0,
        temperature=0.7,
        pad_token_id=None,
    ):
        model = og.Model(input_folder)
        params = og.GeneratorParams(model)

        if pad_token_id:
            params.pad_token_id = pad_token_id

        max_length = len(input_ids) + max_new_tokens

        params.input_ids = input_ids
        params.set_search_options(
            do_sample=do_sample,
            top_k=top_k,
            top_p=top_p,
            temperature=temperature,
            max_length=max_length,
            min_length=max_length,
        )
        params.try_graph_capture_with_max_batch_size(1)

        generator = og.Generator(model, params)

        prompt_start_time = time.perf_counter()
        generator.compute_logits()
        generator.generate_next_token()
        prompt_end_time = time.perf_counter()

        time_to_first_token = prompt_end_time - prompt_start_time

        if max_new_tokens > 1:

            token_gen_times = []
            while not generator.is_done():
                token_gen_start_time = time.perf_counter()
                generator.compute_logits()
                generator.generate_next_token()
                token_gen_end_time = time.perf_counter()

                token_gen_times.append(token_gen_end_time - token_gen_start_time)

            if token_gen_times:
                # List will be empty if we generated 1 or 0 tokens, and we don't
                # want a divide-by-zero error in those cases
                avg_token_gen_latency_s = sum(token_gen_times) / len(
                    token_gen_times
                )
                tokens_per_second = 1 / avg_token_gen_latency_s

        return [generator.get_sequence(0)]

Then, I call generate(tokenizer(prompt), max_new_tokens=1) dozens of times while running the MMLU accuracy test. Each prompt adds a bit more memory utilization until the system crashes.

Screenshots

Here is a screenshot of system and iGPU memory utilization. It is climbing like a staircase due to the memory leak, when it should be flat.

For reference, here is the exact same MMLU accuracy test code running on a Huggingface Transformers implementation of Phi-3-Mini on CPU. Memory utilization is flat, as expected.

The Question

What do I do about this memory leak? Do I need to do some explicit garbage collection in my code to make my generate() function save to run many times in a loop?

Since you are using graph capture, can you try deleting the generator object after generation is completed?

onnxruntime-genai/examples/python/phi3-qa.py

Lines 71 to 72 in 8608d13

    
           # Delete the generator to free the captured graph for the next generator, if graph capture is enabled 
        
           del generator

I already tried putting del generator and del params at the bottom of my function. I still saw a memory leak.

There has been a recent ONNX Runtime fix and a recent ONNX Runtime GenAI fix for a memory leak issue with DirectML. These fixes will be in the upcoming ONNX Runtime GenAI v0.3.0 release, which is expected to be released this week, and may fix your issue. In the meantime, you can re-build both ONNX Runtime and ONNX Runtime GenAI using the latest commits on the main branches and see if your issue is resolved.

Thanks for the heads up!

@kunal-vaishnavi are there any updates on the 0.3.0 release?

There have been some last-minute PRs that need to be included in the release such as this one. The changes for the v0.3.0 release branch can be tracked here. Once merged, v0.3.0 should be released by end of this week.

@kunal-vaishnavi, very nice meeting you in person the other day!

Today I downloaded 0.3.0 and still saw the memory leak during my MMLU test. So, I decided to dig further and found something interesting.

The memory leak presented during MMLU, but not during performance benchmarking. I dug further and found the only meaningful difference between my MMLU and benchmark code was that MMLU delivered a unique prompt on every iteration, whereas my benchmark reused the same prompt across iterations.

Here is a quick psuedocode that has no memory leak:

prompt = random_sentence() # generate a sentence of random words with between 100-200 words
for _ in range(1000):
  input_ids = tokenzer.encode(prompt)
  response = model.generate(input_ids)

And here is a psuedocode that does show the memory leak:

for _ in range(1000):
  prompt = random_sentence() # generate a sentence of random words with between 100-200 words
  input_ids = tokenzer.encode(prompt)
  response = model.generate(input_ids)

The only difference between these two programs is that the plain-text prompt changes between loop iterations.

PS. I still get the memory leak even when I do not call tokenizer.decode(reponse) at all, which is why I omitted it from the examples.

@kunal-vaishnavi, very nice meeting you in person the other day!

Very nice to meet you as well!

The memory leak presented during MMLU, but not during performance benchmarking. I dug further and found the only meaningful difference between my MMLU and benchmark code was that MMLU delivered a unique prompt on every iteration, whereas my benchmark reused the same prompt across iterations.

Thank you for digging further into the memory leak. We will investigate and get back to you.

Thanks @kunal-vaishnavi! Are there any updates?

Hi @jeremyfowers, we are still investigating it. It takes bit long as it involves a few components.

OK, thank you!

	# Delete the generator to free the captured graph for the next generator, if graph capture is enabled
	del generator

Memory leak during back-to-back inferences

My system

Software

My Code

Screenshots

The Question