GAMES-UChile / mogptk

Multi-Output Gaussian Process Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Covariance matrix maxing out CUDA memory

sgalee2 opened this issue · comments

I am doing some simple analysis of model covariance matrices with the multioutput spectral mixture kernel. The model in question has

  • 20 channels
  • $\approx$ 100 input points for each channel

This results in a covariance matrix with dimensions less than $10^4 \times 10^4$, which in float64 should be pretty small in storage... However, when I call

model.gpr.kernel(inputs)

the resulting covariance matrix cannot be fully formed before my VRAM (15GBs) becomes entirely occupied.

Any ideas as to why this memory leak is happening?

Thanks (again)!

Thanks for raising this issue. What version of PyTorch are you using? A quick calculation would expect ~31 MiB in memory usage (that is 8 bytes per float64, the Gram matrix would be 2000x2000). Running a simple test on the CPU shows me this is allocated 7 times (due to inefficiencies, such as unnecessary copying or non-inplace operations), which we should work on improving.

The fact that it allocates 15 GB is way too much though. Is this at the first call of that function? Or could it be memory leaks from earlier? If you try with a blank slate, does this happen as well? What is the shape of the inputs tensor you pass?

I don't have a GPU at hand, but try and check the memory usage statistics for CUDA: https://pytorch.org/docs/stable/cuda.html#memory-management perhaps it can pinpoint to a particularly large tensor or perhaps memory that doesn't get released (leaks).

This is in PyTorch 1.13.0+cu116, the issue usually arises after more than one call but not always... I have kept diagnostics open during the whole script and the VRAM occupied spikes when any function regarding the kernel is called e.g. model.gpr.kernel(inputs), model.gpr.kernel.Ksub(inputs).

inputs is roughly a [3000 x 2] array.

This may be due to the use of double precision by default, this has now changed as of v0.3.5 which defaults to float32. Unfortunately the memory usage of PyTorch in VRAM is quite large since it has to fit the whole CUDA context as well as the PyTorch context.

I've done some tests to check for scalability of the various parameters, see the results below. Each variable is changed while the others are kept constant. In the case of output dimension, we use 1600 training points and divide them by the number of channels to get the number of training points per channel (1600 total).

exact_mosm

Conclusions:

  • There are no memory leaks since memory is constant for iterations and time is linear
  • It is quadratic in the number of data points, both in memory and time. I believe that Cholesky decomposition is really fast but is O(n^3) but that is this range the O(n^2) nature of the kernel matrix and other calculations is more. For larger numbers of data points the cubic nature of time should take over.
  • Both the number of input dimensions and mixture components are linear in memory and time and are very fast in general. Note that these variables depend on the kernel and not the inference model.
  • It is quadratic in time but something like O(1/n) for memory over the output dimensions. Both results are interesting. I believe that the sequential nature of how we compute each sub Gram matrix for each channel combination allows us to use less memory (which is recycled for each channel combination). Time is quadratic since we need to evaluate M*M sub Gram matrices, with M the number of channels.

Therefore, increasing the number of channels will quickly degrade performance as shown since it scales quadratic and on top of that it is really slow (14 seconds for 16 output dimensions, 1600 training points in total, 2 input dimensions, 2 mixture components, and 100 iterations).

the resulting covariance matrix cannot be fully formed before my VRAM (15GBs) becomes entirely occupied.

I've done some small tests regarding memory usage for the MOSM with (output_dims, data_points, input_dims, components):

  • Default memory usage: 14 kB
  • (1, 1, 1, 1): 30 kB
  • (1, 1, 2, 1): 8.5 MB
  • (1, 150, 1, 1): 1.2 MB
  • (1, 150, 2, 1): 99 MB
  • (20, 20, 1, 1): 2.2 MB
  • (20, 20, 2, 1): 10.7 MB
  • (20, 3000, 1, 1): 188 MB
  • (20, 3000, 2, 1): 234 MB

The increase in memory usage for two input dimensions is surprising, otherwise the scaling looks fine to me. Not sure why we allocate 234 MB for a 3000x300 Gram matrix and all intermediate tensors though. However, it isn't nearly as much as 15 GB @sgalee2! So something must be off in your case and I can't replicate it.

I've reverted the change of the default dtype to float64 to avoid precision errors. But you can set it manually to float32, this would alleviate about 50% of the VRAM required.

It occurred to me, perhaps your input data was incorrect and you really sent it 3000 data points per channel, and not 150 per channel, for all the 20 channels. This would quickly deplete VRAM since it would require about 18 GB of VRAM using the float64.

I was wrong in my previous post about being able to reduce memory usage further (or so I believe), since PyTorch requires all intermediate results (as calculated in the Exact model or the kernel) to be stored. This is why using batches is so important for other uses of the GPU, which are unavailable in our use-case. We must fit the entire data set in our model, which comes down to O(Q * M^2 * N^2) multiplied by a constant (Q components, M channels, N data points per channel) in memory usage. This is why reducing your data set, or using inducing points has been such a prominent advance in the field.

Considering this issue closed since there is no bug in the library AFAIK.