How can I use multiple GPUs for inference.

Question

How can I use multiple GPUs for inference.

WangxuP opened this issue 10 months ago · comments

here is my GPUs info:

GPU info: H800 * 8
CUDA: 11.8
nvidia-smi
Mon Sep  4 10:39:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P2    65W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Graphics...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   29C    P2    67W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Graphics...  Off  | 00000000:48:00.0 Off |                    0 |
| N/A   30C    P2    67W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |

here is my inference code:

from peft import PeftModel
from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch

# create tokenizer
base_model = "/home/WizardCoder-15B-V1.0/"
tokenizer = AutoTokenizer.from_pretrained(base_model)

# base model
model = AutoModelForCausalLM.from_pretrained(
       base_model,
       torch_dtype=torch.float16,
       device_map="auto",
   )

  # LORA PEFT adapters
adapter_model = "/home/adapter_model"

model = PeftModel.from_pretrained(
       model,
       adapter_model,
       #torch_dtype=torch.float16,
   )
model.eval()


# prompt
prompt = "请写一个sql， 使用dual表查看当前时间"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs["input_ids"].to('cuda')
# Generate
generate_ids = model.generate(input_ids=inputs, max_new_tokens=30)
print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

here is my runing method, and runing result

(base) [root@localhost WizardCoder]# CUDA_VISIBLE_DEVICES=6,7 /home/wxp/python/pythonwizard/bin/python3 DEMCoder_test_v2.py
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
请写一个sql， 使用dual表查看当前时间戚Fartherthanrrayrraypadder殊 Provide

why ??? How can I use multiple GPUs for inference. help me, Thanks!!

WangxuP · Answer 1 · Mon Sep 04 2023 15:01:49 GMT+0800 (China Standard Time)

When I used the conference code you provided, there was still a problem, and the problem is below:

CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \
    --base_model "WizardCoder-15B-V1.0" \
    --input_data_path "data.jsonl" \
    --output_data_path "result.jsonl"

input_data_path as show:

{"idx": 11, "Instruction": "Write a Python code to count 1 to 10."}

I get a error results:

{"id": 0, "instruction": "Write a Python code to count 1 to 10.", "wizardcoder": "```pythonrrays =rrayrrayss = ArraysWithsWithoutrraysss = =   ityityrrayrrayss = ArraysWithsWithsWithout including including includingCOMMCOMMCOCOapodsapodsuppeuppeanoanoanoanoanoanoanoanoanorrayrrayrrayrrayrrays =ViewDatailsabout howabout how howaboutaboutrrayrrayrrayrrayss = = =cutcutcutrrayCOUNTCOUNTCOUNTetcetcetc"}

why???

ChiYeung Law · Answer 2 · Mon Sep 04 2023 17:08:33 GMT+0800 (China Standard Time)

CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \
    --base_model "WizardCoder-15B-V1.0" \
    --input_data_path "data.jsonl" \
    --output_data_path "result.jsonl"

This works fine on our machine.
Which version of transformers do you use?

WangxuP · Answer 3 · Wed Sep 06 2023 10:23:51 GMT+0800 (China Standard Time)

@ChiYeungLaw

CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \
    --base_model "WizardCoder-15B-V1.0" \
    --input_data_path "data.jsonl" \
    --output_data_path "result.jsonl"

This works fine on our machine. Which version of transformers do you use?

there is my pkgs info:

Package                  Version
------------------------ ----------
accelerate               0.20.3
aiofiles                 23.2.1
aiohttp                  3.8.5
aiosignal                1.3.1
annotated-types          0.5.0
async-timeout            4.0.3
attrs                    23.1.0
black                    23.3.0
certifi                  2023.5.7
charset-normalizer       3.1.0
cmake                    3.26.4
dataclasses-json         0.5.14
filelock                 3.12.2
fire                     0.5.0
flake8                   6.0.0
frozenlist               1.4.0
fsspec                   2023.6.0
greenlet                 2.0.2
h11                      0.9.0
html5tagger              1.3.0
httpcore                 0.11.1
httptools                0.6.0
httpx                    0.15.4
huggingface-hub          0.15.1
idna                     3.4
Jinja2                   3.1.2
langchain                0.0.271
langsmith                0.0.26
lit                      16.0.6
MarkupSafe               2.1.3
marshmallow              3.20.1
mccabe                   0.7.0
mpmath                   1.3.0
multidict                5.2.0
mypy-extensions          1.0.0
networkx                 3.1
numexpr                  2.8.5
numpy                    1.25.0
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
packaging                23.1
pathspec                 0.11.1
pip                      23.1.2
psutil                   5.9.5
pydantic                 2.2.1
pydantic_core            2.6.1
pyflakes                 3.0.1
PyYAML                   6.0
regex                    2023.6.3
requests                 2.31.0
rfc3986                  1.5.0
safetensors              0.3.1
sanic                    20.12.6
sanic-routing            23.6.0
setuptools               58.1.0
six                      1.16.0
sniffio                  1.3.0
SQLAlchemy               2.0.20
sympy                    1.12
tenacity                 8.2.3
termcolor                2.3.0
tokenizers               0.13.3
torch                    2.0.1
torch-tb-profiler        0.4.1
tqdm                     4.65.0
tracerite                1.1.0
transformers             4.29.0
triton                   2.0.0
typing_extensions        4.6.3
typing-inspect           0.9.0
ujson                    5.8.0
urllib3                  1.26.7
utils                    1.0.1
uvloop                   0.17.0
websockets               9.1
wheel                    0.40.0
yarl                     1.9.2

WangxuP · Answer 4 · Wed Sep 06 2023 10:27:36 GMT+0800 (China Standard Time)

CUDA_VISIBLE_DEVICES=6,7 python src\inference_wizardcoder.py \
    --base_model "WizardCoder-15B-V1.0" \
    --input_data_path "data.jsonl" \
    --output_data_path "result.jsonl"
This works fine on our machine. Which version of transformers do you use?
@ChiYeungLaw
Can you tell me the information about the machine you are testing. For example, GPU, Cuda, Python? thks!

ChiYeung Law · Answer 5 · Wed Sep 06 2023 10:40:33 GMT+0800 (China Standard Time)

torch==2.0.1
transformers==4.29.2
2xV100 32GiB 
python==3.10
cuda==11.4

WangxuP · Answer 6 · Wed Sep 06 2023 15:08:03 GMT+0800 (China Standard Time)

@ChiYeungLaw
个人猜测问题可能是 cuda11.8 H800 GPU之间的兼容关系导致的。由于租的服务器已经到期，transformers这个问题暂时也没办法验证了。
另外的一个问题是，我们都知道WizardCoder-15B-V1.0模型大概有32G，那么我们通过单机多卡（2*V100）的模式加载进去之后，单个卡上的显存消耗是否在32G/ 2 = 16G左右呢？您可以帮忙验证一下吗？然后将结果单卡运行的显存占用以及2卡运行的显存占用结果贴出来吗？谢谢了！

I guess is that the issue may be caused by the compatibility relationship between cuda11.8 H800 GPUs. Due to the expiration of the cloud server, it is currently impossible to verify this issue。
Another question is that we all know that the WizardCoder-15B-V1.0 model has approximately 32GB. So, after we load it in a single machine multi GPU (2 * V100) mode, will the graphics memory consumption on a single card be around 32GB/2=16GB? Can you show the guess ? Then paste the results of the video memory usage for single GPU operation and the video memory usage results for 2-GPU operation? Thank you!