ggml_new_tensor_impl: not enough space in the context's memory pool

Question

ggml_new_tensor_impl: not enough space in the context's memory pool

m1chae1bx opened this issue 10 months ago · comments

I tried writing a few lines of code. I got my first completion working properly.

# Function that prints hello world
def hello_world():
    print('Hello World!')

hello_world()

But when I started adding more code, I got an error from Turbopilot saying the following:

(turbopilot-test) mbonon@mbonon-tm01926-mbp turbopilot-test % ./turbopilot-bin -m stablecode -f ./models/stablecode-instruct-alpha-3b.ggmlv1.q8_0.bin
[2023-08-13 14:14:20.759] [info] Initializing StableLM type model for 'stablecode' model type
[2023-08-13 14:14:20.760] [info] Attempt to load model from stablecode
load_model: loading model from './models/stablecode-instruct-alpha-3b.ggmlv1.q8_0.bin' - please wait ...
load_model: n_vocab = 49152
load_model: n_ctx   = 4096
load_model: n_embd  = 2560
load_model: n_head  = 32
load_model: n_layer = 32
load_model: n_rot   = 20
load_model: par_res = 1
load_model: ftype   = 2007
load_model: qntvr   = 2
load_model: ggml ctx size = 6169.28 MB
load_model: memory_size =  1280.00 MB, n_mem = 131072
load_model: ................................................ done
load_model: model size =  2809.08 MB / num tensors = 388
[2023-08-13 14:14:22.712] [info] Loaded model in 1951.30ms
(2023-08-13 06:14:22) [INFO    ] Crow/1.0 server is running at http://0.0.0.0:18080 using 8 threads
(2023-08-13 06:14:22) [INFO    ] Call `app.loglevel(crow::LogLevel::Warning)` to hide Info level logs.
(2023-08-13 06:19:29) [INFO    ] Request: 127.0.0.1:52574 0x105008200 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:31) [INFO    ] Response: 0x105008200 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:31) [INFO    ] Request: 127.0.0.1:52575 0x131813a00 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:32) [INFO    ] Response: 0x131813a00 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:36) [INFO    ] Request: 127.0.0.1:52577 0x13200fa00 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:37) [INFO    ] Response: 0x13200fa00 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:37) [INFO    ] Request: 127.0.0.1:52578 0x131813a00 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:38) [INFO    ] Response: 0x131813a00 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:43) [INFO    ] Request: 127.0.0.1:52581 0x13180e000 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:46) [INFO    ] Response: 0x13180e000 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:47) [INFO    ] Request: 127.0.0.1:52582 0x13200fa00 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:48) [INFO    ] Response: 0x13200fa00 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:49) [INFO    ] Request: 127.0.0.1:52583 0x132009200 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:50) [INFO    ] Response: 0x132009200 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:50) [INFO    ] Request: 127.0.0.1:52584 0x125809000 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:54) [INFO    ] Response: 0x125809000 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:54) [INFO    ] Request: 127.0.0.1:52586 0x13180e000 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:56) [INFO    ] Response: 0x13180e000 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:56) [INFO    ] Request: 127.0.0.1:52588 0x131814200 HTTP/1.1 POST /v1/engines/codegen/completions
(2023-08-13 06:19:58) [INFO    ] Response: 0x131814200 /v1/engines/codegen/completions 200 1
(2023-08-13 06:19:58) [INFO    ] Request: 127.0.0.1:52589 0x125809000 HTTP/1.1 POST /v1/engines/codegen/completions
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 510492368, available 268435456)
GGML_ASSERT: /Users/mbonon/coding/turbopilot-test/turbopilot/extern/ggml/src/ggml.c:16810: buf
zsh: abort      ./turbopilot-bin -m stablecode -f

I'm running on a MacBook Pro with Apple M1 Pro chip and 16 GB of memory.

James Ravenscroft · Answer 1 · Sun Aug 13 2023 17:14:26 GMT+0800 (China Standard Time)

Thanks for opening this issue. I think there's an issue with the way the simple example code in the GGML repo allocates memory for the models. I probably need to have a dig around in the ggml code and see if I can get it to allocate more memory.

Did this happen after repeated generations appending to the same file? Are you using the fauxcode vscode plugin or the huggingface plugin?

Andy KoKo · Answer 2 · Sun Aug 13 2023 18:04:08 GMT+0800 (China Standard Time)

I have the same problem when using Fauxpilot. If you use curl post data, there will be no problem, and the codegen-serve works normally.

Michael B · Answer 3 · Sun Aug 13 2023 20:23:22 GMT+0800 (China Standard Time)

Happens after repeated generations I think as I'm typing, also based from the number of calls to the completions endpoint from the logs. I'm using the fauxpilot extension

Gilbok Lee · Answer 4 · Wed Aug 23 2023 16:41:33 GMT+0800 (China Standard Time)

I also experienced it.
I'm running stablecode model on a MacBook Air M2 and 16 GB of memory.
I'm using fauxcode vscode plugin.

Gatsby · Answer 5 · Wed Aug 23 2023 20:43:36 GMT+0800 (China Standard Time)

I use Fauxpilot, and when I use a longer prompt, it encounters such issues and exits.

James Ravenscroft · Answer 6 · Thu Aug 24 2023 21:16:26 GMT+0800 (China Standard Time)

I've deployed a change to allow users to specify smaller batch sizes (#59). Normally we set batch size (number of tokens to attempt to process in the same forward pass) to 512. However, this is quite memory intensive, especially with the larger models (starcoder/wizardcoder). If you build from main and then pass -b 256 or even -b 128 you might find that this issue goes away.

I will package up a new minor release in the next couple of days that will include this change if you don't want to build from main.

Yaiko · Answer 7 · Sun Aug 27 2023 06:55:27 GMT+0800 (China Standard Time)

I've deployed a change to allow users to specify smaller batch sizes (#59). Normally we set batch size (number of tokens to attempt to process in the same forward pass) to 512. However, this is quite memory intensive, especially with the larger models (starcoder/wizardcoder). If you build from main and then pass -b 256 or even -b 128 you might find that this issue goes away.

I will package up a new minor release in the next couple of days that will include this change if you don't want to build from main.

The probem is still happenning for me with stablecode and santacoder I tried even -b 64 and I still get:

ggml_new_object: not enough space in the context's memory pool (needed 510481728, available 268435456)