ggerganov / ggml

Tensor library for machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ggml : improve memory allocation for weights and similar lists of tensors

slaren opened this issue · comments

There are several patterns used to allocate memory for a list of fixed size tensor, such as model weights:

  • Manually calculating the number of elements of each tensor and adding it all up
  • Creating the tensors in a no-alloc context, adding them to list or map, or obtaining them by name from a ggml_context with ggml_get_tensor, summing their sizes and finally allocating them (the last one is $O(N^2)$ )
  • Creating the tensors in a no-alloc context, allocate the weights manually with ggml-alloc, first with a measure allocator and then again with the exact memory requirements (current llama.cpp finetune)
  • Creating the tensors in a no-alloc context, then enumerating the tensors in the context and summing their sizes (new finetune in https://github.com/ggerganov/llama.cpp/pull/3605)
  • Create a ggml_context with a lot of memory and hope for the best

This becomes significantly more complicated when the weights have to be split between different backends (current llama.cpp and ggml-backend wip).

For something so basic, this is a lot more complicated than it should, and we should have a normalized way to do this. At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements. Support for multiple backends will be more complicated.

This could also be useful for debugging operations in compute contexts, where it might be desirable to allocate memory for every tensor in the graph to be able to inspect the results of each op later.

Yes, we should consolidate the different ways of allocating memory.

At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements.

Either this, or even just a function that returns the required memory for a context by doing a similar loop as the one in ggerganov/llama.cpp#3605 would be helpful.