Ad Widget

Collapse

How to detect when a GPU becomes unresponsive?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • joe-synthetaic
    Junior Member
    • Jun 2021
    • 1

    #1

    How to detect when a GPU becomes unresponsive?

    My company has a server with 8 Nvidia RTX 8000 GPUs in it (for AI training). One of the GPUs has been being problematic and will randomly crash and not show up anymore. I want to create a trigger in Zabbix for when that happens so I can get alerted. Right now we are using the Nvidia GPU template that shows items for GPU power, fan speed, utilization, etc.

    I created a disaster trigger for when power consumption hits 0 ({lambda-server:gpu.power[0].last()}=0), figuring that would work. But since the GPU just goes undetected, Zabbix doesn't have a chance to read that it's not getting power, so technically that trigger never hits 0 before the GPU goes out and I don't get alerted. Does anyone know how to create a trigger that just pings the GPU and knows when it's non-responsive?

    I'm an amateur at Zabbix so any help would be appreciated. Thank you!
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1781

    #2
    Hi, how about trying with .nodata(x) trigger?

    Markku

    Comment

    Working...