utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Discussion] Offering CPU and Memory Monitoring Support

wyzhangyuhan opened this issue · comments

commented

Firstly, I'd like to thank you for providing this repository. It has been instrumental in helping us set up our cluster monitoring.

In the course of utilizing your tool, I've added a file to support CPU and memory monitoring specifically for Linux systems. While this addition outputs CPU monitoring in a manner akin to a plugin, I'm uncertain if it aligns perfectly with the direction of your contributions.

I'd be happy to contribute this addition to the community. If you find this functionality valuable, I'm more than willing to refine my code further and submit a PR.

Thanks once again for your invaluable contributions and hard work!

Addition:
I also adjust the dashboard to see cluster info and single node info. 😎
image

  • CPU&Memory metrics
# HELP basic_cpu_sy system process cost
# TYPE basic_cpu_sy gauge
basic_cpu_sy{uuid="123"} 1.3
# HELP basic_cpu_tot cpu cost percetage
# TYPE basic_cpu_tot gauge
basic_cpu_tot{uuid="123"} 1.6
# HELP basic_cpu_us user process cost
# TYPE basic_cpu_us gauge
basic_cpu_us{uuid="123"} 0.3
# HELP basic_info_command_exit_code Exit code of the last scrape command
# TYPE basic_info_command_exit_code gauge
basic_info_command_exit_code 0
# HELP basic_mem_free memory free
# TYPE basic_mem_free gauge
basic_mem_free{uuid="123"} 3.35515268e+08
# HELP basic_mem_tot memory total
# TYPE basic_mem_tot gauge
basic_mem_tot{uuid="123"} 5.93782332e+08
# HELP basic_mem_used memory used
# TYPE basic_mem_used gauge
basic_mem_used{uuid="123"} 2.5704732e+07

Hi,

Thank you very much for your feedback.

Your dashboard really looks great. However, it is out of scope for this exporter, as this one is focused only to gather metrics of Nvidia GPUs.

I think your metrics belong to their own exporter. Prometheus can scrape a lot of exporters without problems, so it'd be the best way forward. As you can see, exporters are pretty atomic in regards to their responsibilities (i.e. they only do one thing): https://github.com/prometheus/prometheus/wiki/Default-port-allocations

By the way, have you checked https://github.com/prometheus/node_exporter and this dashboard. It can be a good idea to simply use them instead of building your own.

commented

Thanks for the reply. The node_exporter and the dashboard might be just what I need. I also get your point about keeping exporters focused.

Big thanks again for all the work you've put into this library. Since you've cleared up my questions so well, I'll go ahead and close out this issue.