vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page:https://docs.vllm.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: non-deterministic output from Mixtral8x7B with temperature = 0

neonesis opened this issue · comments

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.18.0-348.2.1.el8_5.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40

Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7543 32-Core Processor
CPU family:                      25
Model:                           1
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        1
Frequency boost:                 enabled
CPU max MHz:                     2800.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        5589.84
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm sme sev sev_es
Virtualization:                  AMD-V
L1d cache:                       2 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        32 MiB (64 instances)
L3 cache:                        512 MiB (16 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    0-31,64-95      0               N/A
GPU1    SYS      X      SYS     32-63,96-127    1               N/A
NIC0    NODE    SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

🐛 Describe the bug

While using Mixtral-8x7B-Instruct with GPTQ 4-bit quantization (not tested on other variants), I struggle to obtain deterministic outputs.

I set temperature to 0, as well as engine seed and sampling parameters seed to 0.

To reproduce:

from vllm import LLM, SamplingParams
prompt = "[INST] You are a pirate chatbot who always responds with Arr and pirate speak! There's a llama on my lawn, how can I get rid of him? [/INST]"
llm = LLM(model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ", revision="gptq-4bit-32g-actorder_True", dtype="float16", gpu_memory_utilization=0.8, seed=0)
sampling_params = SamplingParams(temperature=0, max_tokens=300, seed=0)
outs = []
for _ in range(10):
        outs.append(llm.generate(prompt, sampling_params)[0].outputs[0].text)
        print(outs[-1])

print(len(outs))  # should be 10
print(len(set(outs)))  # should be 1 if all outputs are the same, but usually is something like 6 (unique outputs)

I expect to obtain 10 identical responses, but they usually (but NOT ALWAYS) start to diverge somewhere in the middle.
Sample 1:

Arr, shiver me timbers! Ye have a llama on yer lawn, ye say? Well, here's what I'd do, me hearty:

1. Talk to the beast: Try to communicate with the llama in a calm, soothing voice. Llamas are intelligent creatures, and it might just understand that it's not welcome on yer land.

2. Scare it off: Make some noise! Llamas can be skittish. Wave yer pirate flag, blow a horn, or bang some pots and pans together. Just be careful not to scare it too much, or it might injure itself.

3. Provide a barrier: If ye have a fence or some other form of barrier, use it to gently guide the llama off yer property. Make sure it's not too high or difficult for the llama to jump over, though.

4. Contact local animal control: If ye can't get rid of the llama yerself, contact yer local animal control or wildlife rescue organization. They'll have the training and equipment to handle the situation safely and humanely.

5. Keep yer distance: Remember, llamas can spit and kick when they feel threatened. So, be sure to keep a safe distance and avoid any sudden movements.

Sample 2:

Arr, shiver me timbers! Ye have a llama on yer lawn, ye say? Well, here's what I'd do, me hearty:

1. Talk to the beast: Try to communicate with the llama in a calm, soothing voice. Llamas are intelligent creatures, and it might just understand that it's not welcome on yer land.

2. Scare it off: Make some noise! Llamas can be skittish. Wave a pirate flag, fire a cannon (if ye have one), or play some loud pirate music. That should do the trick and send the critter runnin'.

3. Build a fence: If the llama keeps returnin', it's time to fortify yer defenses. Erect a sturdy fence to keep the beast at bay. Just make sure it's higher than a llama can jump!

4. Contact local farmers or zoos: If ye can't manage the critter yerself, it's best to call in the professionals. Reach out to local farmers or zoos to see if they can relocate the llama to a more suitable habitat.

I originally encountered this problem (in identical form) while using vLLM as a Triton back-end, but it appears even in the simple example above.

When using identical model via HF-transformers, the output is identical every time (assuming temperature=0).

Welcome to the real world!

You may find it surprising:

import torch
torch.use_deterministic_algorithms(True)

# construct input, weight, bias for a 4096 -> 4096 linear layer
input = torch.randn(4096, 4096).cuda()
weight = torch.randn(4096, 4096).cuda()

# run the forward pass
output = torch.nn.functional.linear(input, weight)

The above code will throw an error because the linear operation is not deterministic:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    output = torch.nn.functional.linear(input, weight)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Actually, many cuda operations are non-deterministic by default.

For more details, I suggest reading https://pytorch.org/docs/stable/notes/randomness.html .

The short answer is, being fully deterministic will dramatically slow down the code, and sometimes it is even impossible.

In addition, this line of code uses cumsum, which does not have deterministic mode as noted by pytorch https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html .

probs_sum = probs_sort.cumsum(dim=-1)

@youkaichao Thank you for the reply. I'm aware of non-deterministic CUDA ops, however this begs the question why the equivalent (the way I see it) code using just base transformers always shows the same output?

from transformers import AutoModelForCausalLM, AutoTokenizer
model_dir="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cuda")
prompt = "[INST] You are a pirate chatbot who always responds with Arr and pirate speak! There's a llama on my lawn, how can I get rid of him? [/INST]"
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="cuda")
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device="cuda")

outs = []
for _ in range(10):
	outs.append(tokenizer.decode(model.generate(input_ids, max_length=1000, temperature=0)[0]))
	print(outs[-1])

print(len(outs))
print(len(set(outs)))