ggerganov / ggml

Tensor library for machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gpt-j, starcoder, gptneox examples cause "not enough space in the context's memory pool" for batches >32

ravenscroftj opened this issue · comments

Hi all - I'm working on an issue where users on M1 apple silicon get ggml_new_tensor_impl: not enough space in the context's memory pool when they try to use starcoder or gptneox models from turbopilot (ravenscroftj/turbopilot#47).

I've managed to get my hands on a mac mini for a few days and I can see that the mem_per_token calculation is coming out a long way short of the real memory requirements for at least a few of the example program flavours (I tried gpt-j, gpt-neox and starcoder). This problem occurs if the batch size is big enough that accelerated matrix operations kick in (it's batch sizes of 32 I seem to recall?)

I get similar issues when I compile with -DGGML_OPENBLAS=ON on an intel machine.

Taking the gpt-neox example to illustrate the problem:

  1. I increased the initial buffer size from 256 to 512MB - the behaviour is the same with the original value too.
  2. After running the initial call to gpt_neox_eval with 4 dummy tokens and mem_per_token = 0 I get used_mem = 1417200 so therefore mem_per_token is set to 1417200 / 4 = 354300
  3. I pass in long source code file with -b 512, by my calculation we need 512*354300=181401600 (~173MB) of buffer
  4. The program crashes: ggml_new_object: not enough space in the context's memory pool (needed 609977808, available 536870912)

This would seem to imply that the real/actual value of mem_per_token should be 609977808 / 512 = 1191362.90625 rather than 354300 bytes I guess?

The only fix I've been able to identify is to multiply the mem_per_token by some fixed amount per model. However, I'm sure there must be a nicer way to do this?

I'm guessing that ggml-alloc might help but I've not quite got my head around how to use this yet. @slaren's suggestion here to update the examples to use alloc would definitely be appreciated! Especially if it helps resolve this problem.

Additional Weirdness/Partial Solution

I had a brainwave and tried running the initial call to gpt_neox_eval with 64 dummy tokens to see if that caused additional overheads related to the BLAS/Accelerate stuff to magically fall out and it actually worked when the buffer is set to 512MB.

When I set the buffer back to 256MB I get:

ggml_new_object: not enough space in the context's memory pool (needed 510481984, available 268435456)

I feel like I'm really close to figuring this thing out but I've spent a few hours and my brain is starting to melt now.

Thanks in advance for any help guys.

The mem_per_token is just an approximation, and in practice it doesn't work very well. ggml-alloc solves this instead by creating a dummy graph without allocating memory for the tensors (using a no_alloc context), and measuring the size of the buffer required to allocate all the tensors in the computation graph. Additionally, it reduces memory usage by allocating tensors based on the evaluation order of the graph. So, for example, if it can determine that a tensor won't be used anymore after some operation, it will reuse its memory for other tensors allocated after that operation.

As far as I know, currently the only working example of how to use it is in llama.cpp, but I would be happy to adapt some of the examples here to show how to use it. @ggerganov let me know if that would be desired.

Yes, currently only llama.cpp uses the ggml-alloc mechanism, but it is the recommended way going forward.
An example of how to use ggml-alloc in the ggml repo would be really helpful. I know that the sam.cpp and stable-diffusion.cpp projects would be interested in utilizing ggml-alloc as well.

Probably it's best to start with gpt-2?