utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running Exporter Causes Stuttering in All Games

laetha opened this issue · comments

I've been using this amazing exporter for a month or two now, but I noticed in almost all games (and even some videos) that there's be some pretty constant stuttering. This would occur once every 30 seconds or so and would essentially look like someone pressed pause then resume once really quickly.

I troubleshooted everything under the sun. I ran DDU, did a full Windows Reinstall, disabled/uninstalled any overlays I had running. It turns out the culprit was this exporter. Disabling the exporter made the problem go away immediately.

I also run the Prometheus Windows Exporter (https://github.com/prometheus-community/windows_exporter) and it doesn't seem to cause the same issue.

Unfortunately I don't really have any other info to share with you about this, and maybe there's nothing that can be done, but I thought I'd mention it in case there is a possible solution.

My Main Specs:
AMD Ryzen 3900x
EVGA 3080 Ultra
1TB NVME
64GB DDR4 @3600

Thanks!

Thanks, I think the issue might be caused by the nvidia-smi command running under the hood.

You can try to run the following command while the game is still running (in a windowed mode, or you run the command scheduled/remotely etc):

nvidia-smi --query-gpu="timestamp,driver_version,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending" --format=csv

And see if the stutter happens.

If it happens, it is not the exporter's fault but nvidia-smi's.

In that case, you could try to query fewer fields at a time and see if you find a field that is the culprit.

If you succeed with that, then you can specify the query field names explicitly, omitting the field with the issue: https://github.com/utkuozdemir/nvidia_gpu_exporter/blob/master/CONFIGURE.md#command-line-reference.

Also, you can consider trying different driver versions - it might be a driver issue.

I appreciate the feedback. I tested this while running 3DMark TimeSpy just so I could have a consistent GPU load to compare.

Without your GPU Exporter running, smooth. With it running, hitches every 5-10 seconds. Most notably, they seemed to happen at the same spots every time. I did try running the command you gave me while the benchmark was running. It's hard to pin down, but it LOOKED like there may have been a stutter every time I executed that command as well, in addition to the usual intermittent ones.

I might go through field by field and try to find a culprit, but for now I thought I'd let you know the result, thanks!

Nice. Those stutters you mention that happen every 5-10 seconds is actually consistent with the stutter when you run the nvidia-smi command manually - the way Prometheus works is, it hits the exporter's http endpoint every 15 seconds or so (or whatever scrape_interval it is configured with). Every hit to the metrics endpoint runs nvidia-smi under the hood.

This indicates that the issue is not caused by this exporter.

You can still try to find out if excluding some fields from the query will help or not, and based on that can explicitly configure the query fields for the exporter. If you find something, please share it here.

I'll close this one, don't think much can be done on exporter's side here. But if you have some findings, please feel free share.