ggerganov / ggml

Tensor library for machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ggml_allocr_alloc_graph allocated overlapping tensor memory

bssrdf opened this issue · comments

Hi, I have encountered a strange issue using ggml_allocr_alloc_graph to allocate tensor data. When building the graph, I used no-alloca context and later used ggml_allocr_alloc_graph to allocate all tensors' data. However, I noticed two particular tensors have exactly the same memory address for their data member. Is this a bug?

You can replicate the issue using my branch here. After building ggml, run ./bin/test-alloc-graph.

The graph is a simple one:
test-alloc-graph-forward dot

This is not bug, it is actually the main function of ggml-alloc. The memory of the tensors with intermediate results is reused as soon as they aren't needed anymore to reduce the size of the compute buffers. If you want every tensor to have a different address, you can use a context without no_alloc, or ggml_backend_alloc_ctx_tensors.
If you only want to inspect the results of intermediate computations, you can also compute the graph one node at a time, such as:

    for (int i = 0; i < g1->n_nodes; i++) {
        struct ggml_tensor * t1 = g1->nodes[i];
        struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
        ggml_backend_graph_compute(backend, &g1v);
    }

There was also a callback added to ggml_backend_sched for this purpose in ggerganov/llama.cpp#4935.
If you want to keep some of the intermediate results, the recommended approach would be pre-allocate some tensors in a different buffer and use ggml_cpy to copy the result there. Technically it is also possible to add a dependency at the end of the graph with a no-op such as ggml_scale(ctx, a, 1), but I wouldn't recommend that.

Thanks for the quick response.

Sorry I am new to ggml. I understand this memory overwrite is fine for inference (i.e., forward compute). How about backward compute? Won't this memory overwrite defeat the purpose of backpropagation for training? I found out this behavior when trainning a VAE.

I don't know much about training, but I believe that the way the training examples in llama.cpp handle this is by adding dependencies at the end of the graph with ggml_scale(ctx, a, 1), which may be the best way to do this at the moment if you need to keep a lot of the intermediate results.

Thanks for the suggestions.