ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

repeatability problem with CUDA backend

steampunque opened this issue · comments

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

9900k+GTX1070, b2840

There have been several issues opened regarding this problem over the last couple weeks but as of latest version there still seems to be a problem with repeatability using CUDA backend. To dig into this problem I ran an automated 100 question benchmark using server across a range of different backends, batch sizes, continuos batch on/off, prompt cache on/off, fattn on/off to see if I could find any patterns. The short version of result is : the only way to get repeatable result out of CUDA backend is to configure one slot and turn prompt cache off. Fully offloaded vulkan backend does not have this issue and will give repeatable results vs. both batch size and number of slots and prompt cache on/off. Fully unloaded model with CUDA (CPU runs all model layers but I believe KV cache is still on the GPU) on also does not give repeatable results.

Here is a bunch of data summarizing the testing I did:

b2840

PARAMETERS CB=continuous batch NP = servers slots FATTN = flash attention
NB = batch size NGL = offload layers
CUDA = cuda backend (default unless otherwise noted)
VULKAN = vulkan backend
KOMPUTE = Kompute backend (patched to enable Q8_0 and work with b2840)

TEST acc 100 questions and self check response (200 total prompts processed)
MODEL llama3 instruct Q8_0 fully offloaded to 4070 unless otherwise noted
TEMP 0
NGL=33

Outputs : time to finish fun, fractional accuracy. For repeatible result, fractional
accuracy will be identical vs parameter changes.

CB=0

NP=1 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m25.006s 1m25.870s 32 0.540 0.530
real 0m49.284s 0m49.853s 64 0.510 0.520
real 0m41.084s 0m41.605s 128 0.510 0.520
real 0m41.116s 0m41.659s 256 0.510 0.520

NP=2 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m24.881s 1m26.139s 32 0.520 0.540
real 0m49.176s 0m49.934s 64 0.530 0.510
real 0m41.050s 0m41.725s 128 0.530 0.510
real 0m41.010s 0m41.602s 256 0.530 0.510

CB=1

NP=1 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m24.962s 1m25.993s 32 0.540 0.530
real 0m49.308s 0m49.952s 64 0.510 0.520
real 0m41.059s 0m41.603s 128 0.510 0.520
real 0m41.108s 0m41.606s 256 0.510 0.520

NP=1 CUDA NGL=0
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 10m16.770s 10m33.766s 32 0.530 0.530
real 7m18.657s 7m18.347s 64 0.510 0.530
real 5m34.666s 5m41.302s 128 0.510 0.530
real 5m37.105s 5m40.395s 256 0.510 0.530

NP=1 VULKAN=1 REPEATABLE
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real xxxxxxxxxx 1m57.893s 32 xxxxx 0.520
real xxxxxxxxxx 1m26.054s 64 xxxxx 0.520
real xxxxxxxxxx 1m20.446s 128 xxxxx 0.520
real xxxxxxxxxx 1m20.527s 256 xxxxx 0.520

NP=2 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m25.016s 1m26.186s 32 0.520 0.540
real 0m49.305s 0m49.933s 64 0.530 0.510
real 0m41.089s 0m41.632s 128 0.530 0.510
real 0m41.094s 0m41.678s 256 0.530 0.510

NP=2 CUDA NGL=0
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 11m19.782s 10m36.166s 32 0.530 0.540
real 7m6.547s 7m8.712s 64 0.530 0.510
real 5m44.976s 6m29.562s 128 0.530 0.520
real 5m41.204s 6m5.406s 256 0.530 0.520

NP=2 VULKAN=1 REPEATABLE
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real xxxxxxxxxx 1m57.599s 32 xxxxx 0.520
real xxxxxxxxxx 1m26.540s 64 xxxxx 0.520
real xxxxxxxxxx 1m21.183s 128 xxxxx 0.520
real xxxxxxxxxx 1m21.237s 256 xxxxx 0.520

NP=3 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m25.218s 1m26.446s 32 0.530 0.530
real 0m49.563s 0m50.385s 64 0.510 0.520
real 0m41.303s 0m41.977s 128 0.510 0.510
real 0m41.287s 0m41.967s 256 0.510 0.510

NP=3 VULKAN=1 REPEATABLE
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real xxxxxxxxxx 1m59.059s 32 xxxxx 0.520
real xxxxxxxxxx 1m27.342s 64 xxxxx 0.520
real xxxxxxxxxx 1m21.875s 128 xxxxx 0.520
real xxxxxxxxxx 1m21.494s 256 xxxxx 0.520

NP=3 KOMPUTE=1 F16 NGL=21/33
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real xxxxxxxxxx 11m19.714s 32 xxxxx 0.540
real xxxxxxxxxx 10m0.673s 64 xxxxx 0.540
real xxxxxxxxxx 10m50.298s 128 xxxxx 0.530
real xxxxxxxxxx 10m44.947s 256 xxxxx 0.530

NP=3 KOMPUTE=1 Q8_0 NGL=33/33 REPEATABLE
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real xxxxxxxxxx 4m16.424s 32 xxxxx 0.520
real xxxxxxxxxx 4m22.849s 64 xxxxx 0.520
real xxxxxxxxxx 4m21.997s 128 xxxxx 0.520
real xxxxxxxxxx 4m23.100s 256 xxxxx 0.520

NP=4 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m25.203s 1m26.512s 32 0.550 0.520
real 0m49.416s 0m50.446s 64 0.550 0.520
real 0m41.352s 0m42.007s 128 0.510 0.510
real 0m41.379s 0m42.035s 256 0.510 0.510

CACHE=0

NP=4 CUDA
TIME FATTN 1 FATTN 0 NB FATTN 1 FATTN 0
real 1m25.163s 1m26.578s 32 0.550 0.520
real 0m49.402s 0m50.467s 64 0.550 0.520
real 0m41.363s 0m42.028s 128 0.510 0.510
real 0m41.394s 0m41.993s 256 0.510 0.510

A simple way to expose this problem with one prompt is to use the following prompt with
the hermes 2 pro mistral model at Q8_0 (different from the llama3 model above
but exposes issue well).

A pond is filling with lillies such that at the end of every day the number of
lillies in the pond doubles. At the end of day 48, the pond is exactly full
of lillies. How many days did it take for the pond to become half full of
lillies? Explain the answer using step-by-step reasoning and prefix the
final answer with ANSWER:

Using the fully offloaded vulkan backend, the response should be as follows, and you can change prompt cache on/off,
change number of slots, and change batch size from 32 to 256 and get the identical answer :

Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies from day 48 would fill the pond, so the pond is half full.

Now, we need to find out how many days it took for the pond to become half full starting from the first day.

Since the number of lillies doubles every day, if the pond is half full on day 47, that means it was empty on day 46.

So, it took 47 days for the pond to become half full of lillies.

ANSWER: 47 days.

Now with the fully offloaded CUDA backend, slots set to 2 and prompt cache ON, first prompt output:

Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies from day 48 would fill the pond, so the pond is half full.

Now, we need to find out how many days it took for the pond to become half full starting from the first day.

Since the number of lillies doubles every day, if the pond is half full on day 47, that means it was a quarter full on day 46, an eighth full on day 45, and so on.

To find the day when the pond was exactly half full, we need to find the smallest multiple of 2 that is greater than or equal to 48, because if the pond was half full on that day, it would take one more day for it to become completely full.

The smallest multiple of 2 that is greater than or equal to 48 is 56 (48 + 8).

So, it took 56 days for the pond to become half full of lillies.

ANSWER: 56

Now keep sending in the same prompt, eventually it will change its response to this:

Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies from day 48 would fill the pond, so the pond is half full.

Now, we need to find out how many days it took for the pond to become half full starting from the first day.

Since the number of lillies doubles every day, if the pond is half full on day 47, it means that the pond was empty on day 46 (because doubling 0 gives you 0).

So, it took 47 - 46 = 1 day for the pond to become half full of lillies.

ANSWER: 1 day

and occasionally if you randomly give a different prompt then the same prompt again it will sometimes come up with the correct answer:

Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies on day 48 would have come from the doubling on day 47. So, the pond was half full on day 47.

To summarize:

  1. On day 48, the pond is full.
  2. On day 47, the pond is half full.

So, the answer is:

ANSWER: 47 days.

Now change to 1 slot with prompt cache on. Two consecutive identical prompts:

lm pond.txt
Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies from day 48 would fill the pond, so the pond is half full.

Now, we need to find out how many days it took for the pond to become half full starting from the first day.

Since the number of lillies doubles every day, if the pond is half full on day 47, that means it was a quarter full on day 46, an eighth full on day 45, and so on.

To find the day when the pond was exactly half full, we need to find the smallest multiple of 2 that is greater than or equal to 48, because if the pond was half full on that day, it would take one more day for it to become completely full.

The smallest multiple of 2 that is greater than or equal to 48 is 56 (48 + 8).

So, it took 56 days for the pond to become half full of lillies.

ANSWER: 56
bash-5.1$
bash-5.1$
bash-5.1$ lm pond.txt
Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies from day 48 would fill the pond, so the pond is half full.

Now, we need to find out how many days it took for the pond to become half full starting from the first day.

Since the number of lillies doubles every day, if the pond is half full on day 47, it means that the pond was empty on day 46 (because doubling 0 gives you 0).

So, it took 47 - 46 = 1 day for the pond to become half full of lillies.

ANSWER: 1 day

Now shut the prompt cache off and set slots to 1:

lm pond.txt
Let's work backwards from day 48 to find out when the pond was half full of lillies.

On day 48, the pond is full of lillies.

On day 47, half of the lillies from day 48 would fill the pond, so the pond is half full.

Now, we need to find out how many days it took for the pond to become half full starting from the first day.

Since the number of lillies doubles every day, if the pond is half full on day 47, that means it was a quarter full on day 46, an eighth full on day 45, and so on.

To find the day when the pond was exactly half full, we need to find the smallest multiple of 2 that is greater than or equal to 48, because if the pond was half full on that day, it would take one more day for it to become completely full.

The smallest multiple of 2 that is greater than or equal to 48 is 56 (48 + 8).

So, it took 56 days for the pond to become half full of lillies.

ANSWER: 56

It will continue to reliably give this same answer even while running the 100 question benchmark it the background.

So not only is the CUDA backend not repeatable, but it also comes up with the wrong answer on this test prompt. The vulkan backend is both repeatable and came up with the right answer (though reasoning was bad). I also tested KOMPUTE (had to make several patches to enable Q8_0 to run in latest version) and it also gave a repeatable result when fully offloaded but is unusably slow. Vulkan is only 2X slower than CUDA as of today so is very usable but it would be nice to be able to access the speed of the CUDA backend and also have it also give both correct and repeatable results as the vulkan backend does.

the only way to get repeatable result out of CUDA backend is to configure one slot and turn prompt cache off.

This is a known issue, see #7052 and #6950 .

The unfortunate reality is that neural networks are very sensitive to small differences in rounding error for their computation. You only get bit-for-bit identical results if you do the exact same operations in the exact same order. And the optimal order to do the operations in is simply different for different batch sizes. So in the context of the server that would mean that in order to get deterministic results you would have to always run the idle slots with dummy data so that the batch size is constant.

the only way to get repeatable result out of CUDA backend is to configure one slot and turn prompt cache off.

This is a known issue, see #7052 and #6950 .

The unfortunate reality is that neural networks are very sensitive to small differences in rounding error for their computation. You only get bit-for-bit identical results if you do the exact same operations in the exact same order. And the optimal order to do the operations in is simply different for different batch sizes. So in the context of the server that would mean that in order to get deterministic results you would have to always run the idle slots with dummy data so that the batch size is constant.

Its both repeatable and gives the correct answer with the Vulkan backend under the same test conditions across differing slots, differing order, and running simultaneously with another test bench.

Yes, and the reason you'd rather use the CUDA backend despite that is that is is much faster, and that the speed increases further if you add more slots (unlike Vulkan which gets slower as you add more slots). And one of the reasons for the speed difference are optimizations that cause the logits to not be bit-for-bit identical when you vary the batch size. This does not matter for reproducibility using e.g. main, only for the server where the number of parallel completions and therefore the batch size can vary.

I'm not saying it is fundamentally impossible to write the code in such a way that gets you reproducible results for >1 slots but this would require additional work that so far simply no one has done.

Yes, and the reason you'd rather use the CUDA backend despite that is that is is much faster, and that the speed increases further if you add more slots (unlike Vulkan which gets slower as you add more slots). And one of the reasons for the speed difference are optimizations that cause the logits to not be bit-for-bit identical when you vary the batch size. This does not matter for reproducibility using e.g. main, only for the server where the number of parallel completions and therefore the batch size can vary.

I'm not saying it is fundamentally impossible to write the code in such a way that gets you reproducible results for >1 slots but this would require additional work that so far simply no one has done.

No, I do not rather use the CUDA backend when it works like this. I noticed the issue a month or two ago along with a lot of others who have posted about the inference suddenly getting flaky, most likely related to this problem. I finally had enough of getting 3 different answers for the same prompt and I have switched to Vulkan backend since repeatability is more important to me than x2 slowdown. I have a hunch on some of the updates which might have caused this (I think CUDA backend used to be repeatable 2-3 months ago) and if I get a chance I will go back and revert some of those to see if I can identify the culprit of the sudden inconsistencies.

You do you but realistically you'll get better performance if you just use CUDA with a single slot and no prompt cache.

You do you but realistically you'll get better performance if you just use CUDA with a single slot and no prompt cache.

I would do that but it also gives the wrong answer on my test prompt while Vulkan gives the right answer, thats the straw that broke using the CUDA backend for me.

Comparing the answers for individual prompts and seeds is not conclusive regarding bugs since there is a lot of randomness involved when it comes to LLMs being able to answer these things correctly. The rounding error and therefore the results are different between CUDA and Vulkan but to my knowledge neither backend is better on average.

Comparing the answers for individual prompts and seeds is not conclusive regarding bugs since there is a lot of randomness involved when it comes to LLMs being able to answer these things correctly. The rounding error and therefore the results are different between CUDA and Vulkan but to my knowledge neither backend is better on average.

I ran a longer test (LAMBADA) and this does appear to be the case, it was just coincidence on my test prompt that the vulkan tripped onto the right answer.

Repeatability now looking good with CUDA NP=1 CACHE=0, identical results to vulkan and twice as fast:

(no prompt)
Q8 hermes pro b2848 VULKAN NP=2 CACHE=1
3394 1759 5153 0 0 .658

(no prompt)
Q8 hermes pro b2848 CUDA NP=1 CACHE=0
3394 1759 5153 0 0 .658

Another run using a prompt in front of the text CUDA got 6 more words right than VULKAN over 5153 prompts.

(no prompt)
Q8 hermes pro b2848 VULKAN NP=2 CACHE=1
3253 1900 5153 0 0 .631

(prompted)
Q8 hermes pro b2848 CUDA NP=1 CACHE=0
3259 1894 5153 0 0 .632

I also tried rolling back to b2032 prior to a bunch of CUDA updates but I found even worse repeatability on that release with no way to get repeatable result using CUDA even with NP=1 CACHE=0 (possible due to other unrelated bugs).

So I guess rolling with NP=1 CACHE=0 to get repeatability with greedy sampling with CUDA wins as of today.

Thanks for your comments.