tensorflow / serving

A flexible, high-performance serving system for machine learning models

Home Page:https://www.tensorflow.org/serving

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't use single gpu when my computer has 2 gpus

yuxiazff opened this issue · comments

Bug Report

Can't use single gpu when my computer has 2 gpus.

System information

  • CentOS Linux release 7.9.2009 (Core)
  • tensorflow/serving:2.5.3-gpu

Describe the problem

I have installed by the document:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-centos-7-8
image

my computer has 2 gpus, i only want to use one, so i run the command as follows:
docker run --gpus device=0 --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model

but still use all 2 gpus,the nvidia-smi result is as follows:
image

how can i only use one gpu?

Exact Steps to Reproduce

(1) install NVIDIA Container Toolkit follow the document:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-centos-7-8
(2)start tensorflow/serving:2.5.3-gpu

Source code / logs

when i run docker run --gpus device=0 --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model
the output is:
2022-10-18 12:27:19.128636: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-18 12:27:19.175523: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config: model_name: dog_nose_quality model_base_path: /models/my_model 2022-10-18 12:27:19.175807: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models. 2022-10-18 12:27:19.175823: I tensorflow_serving/model_servers/server_core.cc:591] (Re-)adding model: dog_nose_quality 2022-10-18 12:27:19.276589: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: dog_nose_quality version: 1} 2022-10-18 12:27:19.276631: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: dog_nose_quality version: 1} 2022-10-18 12:27:19.276653: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: dog_nose_quality version: 1} 2022-10-18 12:27:19.276764: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /models/my_model/1 2022-10-18 12:27:19.355522: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve } 2022-10-18 12:27:19.355561: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /models/my_model/1 2022-10-18 12:27:19.355710: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-10-18 12:27:19.360819: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2022-10-18 12:27:20.082965: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-18 12:27:20.084562: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:af:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-18 12:27:20.084584: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-18 12:27:20.088378: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2022-10-18 12:27:20.088414: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2022-10-18 12:27:20.089626: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2022-10-18 12:27:20.089869: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2022-10-18 12:27:20.090880: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2022-10-18 12:27:20.091814: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2022-10-18 12:27:20.091921: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2022-10-18 12:27:20.097816: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2022-10-18 12:27:20.931434: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-10-18 12:27:20.931467: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2022-10-18 12:27:20.931475: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y 2022-10-18 12:27:20.931479: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N 2022-10-18 12:27:20.939362: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12270 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-10-18 12:27:20.941383: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 13606 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5) 2022-10-18 12:27:21.204513: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle. 2022-10-18 12:27:21.232633: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2700000000 Hz 2022-10-18 12:27:22.278517: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /models/my_model/1 2022-10-18 12:27:22.478408: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 3201643 microseconds. 2022-10-18 12:27:22.515179: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/my_model/1/assets.extra/tf_serving_warmup_requests 2022-10-18 12:27:22.552538: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: dog_nose_quality version: 1} 2022-10-18 12:27:22.584041: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models 2022-10-18 12:27:22.584203: I tensorflow_serving/model_servers/server.cc:367] Profiler service is enabled 2022-10-18 12:27:22.586540: I tensorflow_serving/model_servers/server.cc:393] Running gRPC ModelServer at 0.0.0.0:8500 ... [warn] getaddrinfo: address family for nodename not supported 2022-10-18 12:27:22.603727: I tensorflow_serving/model_servers/server.cc:414] Exporting HTTP/REST API at:localhost:8501 ... [evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

@yuxiazff,

Could you please try passing the device parameter as mentioned in nvidia docker user guide.

The format of the device parameter should be encapsulated within single quotes, followed by double quotes for the devices you want enumerated to the container. For example: '"device=2,3"' will enumerate GPUs 2 and 3 to the container.

You can also try querying GPU UUID using nvidia-smi and specify that to device parameter as mentioned on docker docs.

Query to fetch UUID : nvidia-smi -i 3 --query-gpu=uuid --format=csv

Thank you!

@singhniraj08
thank you for your answer. I have run the command, the outputs are as follows:

(base) [root@znjz-yxr /]# nvidia-smi -i 0 --query-gpu=uuid --format=csv
uuid
GPU-039d7c8f-a03d-7e7e-ed77-6c1d9f46ef10
(base) [root@znjz-yxr /]# nvidia-smi -i 1 --query-gpu=uuid --format=csv
uuid
GPU-919f9c59-c3c1-8bd1-dd3a-9d5b6d0d3caf

image

@yuxiazff,

Please use the UUID of the specific GPU in device parameter you want to use and let us know if it works. Please refer to docker docs for reference.

Example Command: docker run -it --rm --gpus device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a ubuntu nvidia-smi

Thank you!

@singhniraj08
the UUID of the specific GPU dose not work。
I have run the command as follows:
docker run --gpus device=GPU-039d7c8f-a03d-7e7e-ed77-6c1d9f46ef10 --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model

the nvidia-smi output before and after i run the command are as follows:
image

@yuxiazff,

Looks like this issue is with docker command, can you please report the same to docker support.

Meanwhile please try below command and let us know if it works. Thank you!

docker run --gpus '"device=0"' --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=dog_nose_quality --model_base_path=/models/my_model

@singhniraj08,
the command dose not work.
docker run --gpus '"device=0"' --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=my_model --model_base_path=/models/my_model

the 'nvidia-smi' output is:
image

the 'docker run ... ' log is:
(tf2.5.0-20220726) [root@znjz-yxr ai_dog_kingdom]# docker run --gpus '"device=0"' --rm --name my_model --privileged=true -p 8209:8501 -v /home/dell/test/models:/models tensorflow/serving:2.5.3-gpu --model_name=my_model --model_base_path=/models/my_model WARNING: IPv4 forwarding is disabled. Networking will not work. 2022-10-30 01:32:40.528614: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-30 01:32:40.571220: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config: model_name: my_model model_base_path: /models/my_model 2022-10-30 01:32:40.571486: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models. 2022-10-30 01:32:40.571502: I tensorflow_serving/model_servers/server_core.cc:591] (Re-)adding model: my_model 2022-10-30 01:32:40.672308: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: my_model version: 1} 2022-10-30 01:32:40.672358: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: my_model version: 1} 2022-10-30 01:32:40.672379: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: my_model version: 1} 2022-10-30 01:32:40.672459: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /models/my_model/1 2022-10-30 01:32:40.752994: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve } 2022-10-30 01:32:40.753042: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /models/my_model/1 2022-10-30 01:32:40.753155: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-10-30 01:32:40.758703: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2022-10-30 01:32:41.494009: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-30 01:32:41.495619: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:af:00.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.56GiB deviceMemoryBandwidth: 298.08GiB/s 2022-10-30 01:32:41.495637: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2022-10-30 01:32:41.499320: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2022-10-30 01:32:41.499352: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2022-10-30 01:32:41.500574: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2022-10-30 01:32:41.500815: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2022-10-30 01:32:41.501838: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2022-10-30 01:32:41.502763: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2022-10-30 01:32:41.502865: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2022-10-30 01:32:41.508887: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2022-10-30 01:32:42.443347: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-10-30 01:32:42.443383: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2022-10-30 01:32:42.443392: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y 2022-10-30 01:32:42.443396: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N 2022-10-30 01:32:42.451485: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12270 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-10-30 01:32:42.453508: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 13606 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:af:00.0, compute capability: 7.5) 2022-10-30 01:32:42.705575: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle. 2022-10-30 01:32:42.751727: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2700000000 Hz 2022-10-30 01:32:43.855474: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /models/my_model/1 2022-10-30 01:32:44.082625: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 3410168 microseconds. 2022-10-30 01:32:44.107563: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/my_model/1/assets.extra/tf_serving_warmup_requests 2022-10-30 01:32:44.143523: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: my_model version: 1} 2022-10-30 01:32:44.178765: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models 2022-10-30 01:32:44.178963: I tensorflow_serving/model_servers/server.cc:367] Profiler service is enabled 2022-10-30 01:32:44.181207: I tensorflow_serving/model_servers/server.cc:393] Running gRPC ModelServer at 0.0.0.0:8500 ... [warn] getaddrinfo: address family for nodename not supported 2022-10-30 01:32:44.199480: I tensorflow_serving/model_servers/server.cc:414] Exporting HTTP/REST API at:localhost:8501 ... [evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

@singhniraj08,
should I report the problem to docker or nvidia-docker?

@yuxiazff, Please post this issue on nvidia-docker. Thank you!

You need to remove —privileged. Having that will give you access to all devices in the machine (including all GPUs) regardless of your —gpus setting.

@klueska
it works! thank you!

@yuxiazff,

Kindly let us know if this issue can be closed, since the resolution is provided above.

@singhniraj08
yes, this issue can be closed. thank you for your help too!

Thank you for your confirmation.