Can't use single gpu when my computer has 2 gpus

Question

Can't use single gpu when my computer has 2 gpus

yuxiazff opened this issue 2 years ago · comments

yuxiazff commented 2 years ago

Bug Report

Can't use single gpu when my computer has 2 gpus.

System information

CentOS Linux release 7.9.2009 (Core)
tensorflow/serving:2.5.3-gpu

Describe the problem

I have installed by the document:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-centos-7-8

my computer has 2 gpus, i only want to use one, so i run the command as follows:
docker run --gpus device=0 --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model

but still use all 2 gpus，the nvidia-smi result is as follows:

how can i only use one gpu?

Exact Steps to Reproduce

(1) install NVIDIA Container Toolkit follow the document:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-centos-7-8
(2)start tensorflow/serving:2.5.3-gpu

Source code / logs

when i run docker run --gpus device=0 --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model
the output is:
2022-10-18 12:27:19.128636: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-18 12:27:19.175523: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config: model_name: dog_nose_quality model_base_path: /models/my_model 2022-10-18 12:27:19.175807: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models. 2022-10-18 12:27:19.175823: I tensorflow_serving/model_servers/server_core.cc:591] (Re-)adding model: dog_nose_quality 2022-10-18 12:27:19.276589: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: dog_nose_quality version: 1} 2022-10-18 12:27:19.276631: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: dog_nose_quality version: 1} 2022-10-18 12:27:19.276653: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: dog_nose_quality version: 1} 2022-10-18 12:27:19.276764: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /models/my_model/1 2022-10-18 12:27:19.355522: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve } 2022-10-18 12:27:19.355561: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /models/my_model/1 2022-10-18 12:27:19.355710: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-10-18 12:27:19.360819: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2022-10-18 12:27:20.082965: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-18 12:27:20.084562: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:af:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-18 12:27:20.084584: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-18 12:27:20.088378: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2022-10-18 12:27:20.088414: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2022-10-18 12:27:20.089626: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2022-10-18 12:27:20.089869: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2022-10-18 12:27:20.090880: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2022-10-18 12:27:20.091814: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2022-10-18 12:27:20.091921: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2022-10-18 12:27:20.097816: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2022-10-18 12:27:20.931434: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-10-18 12:27:20.931467: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2022-10-18 12:27:20.931475: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y 2022-10-18 12:27:20.931479: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N 2022-10-18 12:27:20.939362: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12270 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-10-18 12:27:20.941383: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 13606 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5) 2022-10-18 12:27:21.204513: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle. 2022-10-18 12:27:21.232633: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2700000000 Hz 2022-10-18 12:27:22.278517: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /models/my_model/1 2022-10-18 12:27:22.478408: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 3201643 microseconds. 2022-10-18 12:27:22.515179: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/my_model/1/assets.extra/tf_serving_warmup_requests 2022-10-18 12:27:22.552538: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: dog_nose_quality version: 1} 2022-10-18 12:27:22.584041: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models 2022-10-18 12:27:22.584203: I tensorflow_serving/model_servers/server.cc:367] Profiler service is enabled 2022-10-18 12:27:22.586540: I tensorflow_serving/model_servers/server.cc:393] Running gRPC ModelServer at 0.0.0.0:8500 ... [warn] getaddrinfo: address family for nodename not supported 2022-10-18 12:27:22.603727: I tensorflow_serving/model_servers/server.cc:414] Exporting HTTP/REST API at:localhost:8501 ... [evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

Niraj Singh · Answer 1 · Wed Oct 19 2022 16:01:50 GMT+0800 (China Standard Time)

@yuxiazff,

Could you please try passing the device parameter as mentioned in nvidia docker user guide.

The format of the device parameter should be encapsulated within single quotes, followed by double quotes for the devices you want enumerated to the container. For example: '"device=2,3"' will enumerate GPUs 2 and 3 to the container.

You can also try querying GPU UUID using nvidia-smi and specify that to device parameter as mentioned on docker docs.

Query to fetch UUID : nvidia-smi -i 3 --query-gpu=uuid --format=csv

Thank you!

yuxiazff · Answer 2 · Wed Oct 26 2022 19:46:01 GMT+0800 (China Standard Time)

@singhniraj08
thank you for your answer. I have run the command, the outputs are as follows:

(base) [root@znjz-yxr /]# nvidia-smi -i 0 --query-gpu=uuid --format=csv
uuid
GPU-039d7c8f-a03d-7e7e-ed77-6c1d9f46ef10
(base) [root@znjz-yxr /]# nvidia-smi -i 1 --query-gpu=uuid --format=csv
uuid
GPU-919f9c59-c3c1-8bd1-dd3a-9d5b6d0d3caf

Niraj Singh · Answer 3 · Thu Oct 27 2022 13:20:09 GMT+0800 (China Standard Time)

@yuxiazff,

Please use the UUID of the specific GPU in device parameter you want to use and let us know if it works. Please refer to docker docs for reference.

Example Command: docker run -it --rm --gpus device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a ubuntu nvidia-smi

Thank you!

yuxiazff · Answer 4 · Thu Oct 27 2022 13:30:42 GMT+0800 (China Standard Time)

@singhniraj08
the UUID of the specific GPU dose not work。
I have run the command as follows:
docker run --gpus device=GPU-039d7c8f-a03d-7e7e-ed77-6c1d9f46ef10 --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model

the nvidia-smi output before and after i run the command are as follows:

Niraj Singh · Answer 5 · Fri Oct 28 2022 13:03:06 GMT+0800 (China Standard Time)

@yuxiazff,

Looks like this issue is with docker command, can you please report the same to docker support.

Meanwhile please try below command and let us know if it works. Thank you!

docker run --gpus '"device=0"' --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model

yuxiazff · Answer 6 · Sun Oct 30 2022 09:41:46 GMT+0800 (China Standard Time)

@singhniraj08,
the command dose not work.
docker run --gpus '"device=0"' --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=my_model --model_base_path=/models/my_model

the 'nvidia-smi' output is:

the 'docker run ... ' log is:
(tf2.5.0-20220726) [root@znjz-yxr ai_dog_kingdom]# docker run --gpus '"device=0"' --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=my_model --model_base_path=/models/my_model WARNING: IPv4 forwarding is disabled. Networking will not work. 2022-10-30 01:32:40.528614: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-30 01:32:40.571220: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config: model_name: my_model model_base_path: /models/my_model 2022-10-30 01:32:40.571486: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models. 2022-10-30 01:32:40.571502: I tensorflow_serving/model_servers/server_core.cc:591] (Re-)adding model: my_model 2022-10-30 01:32:40.672308: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: my_model version: 1} 2022-10-30 01:32:40.672358: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: my_model version: 1} 2022-10-30 01:32:40.672379: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: my_model version: 1} 2022-10-30 01:32:40.672459: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /models/my_model/1 2022-10-30 01:32:40.752994: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve } 2022-10-30 01:32:40.753042: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /models/my_model/1 2022-10-30 01:32:40.753155: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-10-30 01:32:40.758703: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2022-10-30 01:32:41.494009: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-30 01:32:41.495619: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:af:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-30 01:32:41.495637: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-30 01:32:41.499320: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2022-10-30 01:32:41.499352: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2022-10-30 01:32:41.500574: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2022-10-30 01:32:41.500815: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2022-10-30 01:32:41.501838: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2022-10-30 01:32:41.502763: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2022-10-30 01:32:41.502865: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2022-10-30 01:32:41.508887: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2022-10-30 01:32:42.443347: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-10-30 01:32:42.443383: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2022-10-30 01:32:42.443392: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y 2022-10-30 01:32:42.443396: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N 2022-10-30 01:32:42.451485: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12270 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-10-30 01:32:42.453508: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 13606 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5) 2022-10-30 01:32:42.705575: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle. 2022-10-30 01:32:42.751727: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2700000000 Hz 2022-10-30 01:32:43.855474: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /models/my_model/1 2022-10-30 01:32:44.082625: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 3410168 microseconds. 2022-10-30 01:32:44.107563: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/my_model/1/assets.extra/tf_serving_warmup_requests 2022-10-30 01:32:44.143523: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: my_model version: 1} 2022-10-30 01:32:44.178765: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models 2022-10-30 01:32:44.178963: I tensorflow_serving/model_servers/server.cc:367] Profiler service is enabled 2022-10-30 01:32:44.181207: I tensorflow_serving/model_servers/server.cc:393] Running gRPC ModelServer at 0.0.0.0:8500 ... [warn] getaddrinfo: address family for nodename not supported 2022-10-30 01:32:44.199480: I tensorflow_serving/model_servers/server.cc:414] Exporting HTTP/REST API at:localhost:8501 ... [evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

yuxiazff · Answer 7 · Sun Oct 30 2022 09:44:23 GMT+0800 (China Standard Time)

@singhniraj08,
should I report the problem to docker or nvidia-docker?

Niraj Singh · Answer 8 · Mon Oct 31 2022 12:42:28 GMT+0800 (China Standard Time)

@yuxiazff, Please post this issue on nvidia-docker. Thank you!

Kevin Klues · Answer 9 · Thu Nov 03 2022 16:40:57 GMT+0800 (China Standard Time)

You need to remove —privileged. Having that will give you access to all devices in the machine (including all GPUs) regardless of your —gpus setting.

yuxiazff · Answer 10 · Thu Nov 03 2022 16:45:13 GMT+0800 (China Standard Time)

@klueska
it works! thank you!

Niraj Singh · Answer 11 · Thu Nov 03 2022 19:03:28 GMT+0800 (China Standard Time)

@yuxiazff,

Kindly let us know if this issue can be closed, since the resolution is provided above.

yuxiazff · Answer 12 · Thu Nov 03 2022 19:05:41 GMT+0800 (China Standard Time)

@singhniraj08
yes, this issue can be closed. thank you for your help too!

Niraj Singh · Answer 13 · Thu Nov 03 2022 19:12:54 GMT+0800 (China Standard Time)

Thank you for your confirmation.