Access remote metrics

Question

Access remote metrics

DrFatalis opened this issue a year ago · comments

Running this exporter with docker-compose on a machine that do not have any nvidia GPU. The command override works when executed remotely from my local host itself, I received the nvidia-smi from the remote host.

from within the container, the remote ssh command fails because ssh is not found
maybe have a look here, you will maybe need to add openssh-client and get the host ssh key so the container can connect to the remote host.

docker-compose.yml

nvidia-truenas-exporter:
image: utkuozdemir/nvidia_gpu_exporter:1.1.0
container_name: nvidia-truenas-exporter
hostname: nvidia-truenas-exporter
environment:
 - web.listen-address=":9835"
 - web.telemetry-path="/metrics"
 - nvidia-smi-command="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null user@192.168.x.x -p yyy nvidia-smi"
 - query-field-names="AUTO"
 - log.level=info
ports:
 - 9835:9835

Container logs:
ts=2023-03-19T10:19:49.184Z caller=exporter.go:130 level=warn msg="Failed to auto-determine query field names, falling back to the built-in list" error="error running command: exec: "nvidia-smi": executable file not found in $PATH: command failed. code: -1 | command: nvidia-smi --help-query-gpu | stdout: | stderr: "
ts=2023-03-19T10:19:49.186Z caller=main.go:84 level=info msg="Listening on address" address=:9835
ts=2023-03-19T10:19:49.186Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2023-03-19T10:12:04.680Z caller=exporter.go:184 level=error error="error running command: exec: "nvidia-smi": executable file not found in $PATH: command failed. code: -1 | command: nvidia-smi --query-gpu=driver_version,temperature.gpu,clocks.max.sm,fan.speed,memory.total,ecc.errors.uncorrected.volatile.device_memory,enforced.power.limit,persistence_mode,ecc.errors.corrected.aggregate.dram,power.default_limit,pci.domain,inforom.ecc,power.management,ecc.errors.corrected.volatile.register_file,ecc.errors.uncorrected.aggregate.l1_cache,serial,gom.pending,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.texture_memory,clocks.current.sm,clocks_throttle_reasons.active,ecc.errors.corrected.volatile.sram,power.draw,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sync_boost,ecc.mode.pending,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.texture_memory,memory.used,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.cbu,temperature.memory,pci.device_id,inforom.pwr,pcie.link.width.max,utilization.memory,encoder.stats.averageLatency,ecc.errors.uncorrected.volatile.l1_cache,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.sw_thermal_slowdown,ecc.errors.corrected.volatile.dram,ecc.errors.uncorrected.aggregate.total,clocks.max.memory,uuid,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.sram,retired_pages.single_bit_ecc.count,pci.bus,vbios_version,ecc.errors.corrected.aggregate.device_memory,ecc.errors.uncorrected.volatile.register_file,retired_pages.double_bit.count,mig.mode.current,utilization.gpu,clocks.current.graphics,clocks.applications.graphics,pci.device,accounting.buffer_size,ecc.mode.current,power.limit,count,pcie.link.gen.current,ecc.errors.corrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.aggregate.device_memory,mig.mode.pending,clocks_throttle_reasons.hw_thermal_slowdown,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.dram,memory.free,encoder.stats.sessionCount,encoder.stats.averageFps,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.cbu,compute_mode,driver_model.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.uncorrected.volatile.cbu,power.max_limit,clocks.default_applications.graphics,clocks.default_applications.memory,pci.bus_id,name,pcie.link.gen.max,display_mode,clocks_throttle_reasons.hw_slowdown,power.min_limit,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,retired_pages.pending,clocks.max.graphics,driver_model.current,gom.current,ecc.errors.corrected.aggregate.sram,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.register_file,clocks.applications.memory,pcie.link.width.current,inforom.img,inforom.oem,clocks_throttle_reasons.supported,clocks.current.memory,clocks.current.video,index,display_active,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,timestamp,accounting.mode,pci.sub_device_id,pstate --format=csv | stdout: | stderr: "

Utku Özdemir · Answer 1 · Mon Jun 05 2023 18:29:08 GMT+0800 (China Standard Time)

It seems you did not pass the --nvidia-smi-command the right way - the error complains about nvidia-smi binary not being present in the container, not ssh binary:

error running command: exec: "nvidia-smi"

Please see "Running in Docker" section here - you need to mount nvidia-smi binary from the host into the container, along with some other shared libraries: https://github.com/utkuozdemir/nvidia_gpu_exporter/blob/master/INSTALL.md#running-in-docker

Also, ssh client is not included in the image as you pointed out. But that is beyond the scope of the project - you can always write your own Dockerfile with utkuozdemir/nvidia_gpu_exporter as a base image and install the packages you need inside, building a custom image. You might need ssh, someone else might require another package/binary.

Regarding the SSH keys, again, you need to create them outside, and mount them into the container in a secure way.