My company has a server with 8 Nvidia RTX 8000 GPUs in it (for AI training). One of the GPUs has been being problematic and will randomly crash and not show up anymore. I want to create a trigger in Zabbix for when that happens so I can get alerted. Right now we are using the Nvidia GPU template that shows items for GPU power, fan speed, utilization, etc.
I created a disaster trigger for when power consumption hits 0 ({lambda-server:gpu.power[0].last()}=0), figuring that would work. But since the GPU just goes undetected, Zabbix doesn't have a chance to read that it's not getting power, so technically that trigger never hits 0 before the GPU goes out and I don't get alerted. Does anyone know how to create a trigger that just pings the GPU and knows when it's non-responsive?
I'm an amateur at Zabbix so any help would be appreciated. Thank you!
I created a disaster trigger for when power consumption hits 0 ({lambda-server:gpu.power[0].last()}=0), figuring that would work. But since the GPU just goes undetected, Zabbix doesn't have a chance to read that it's not getting power, so technically that trigger never hits 0 before the GPU goes out and I don't get alerted. Does anyone know how to create a trigger that just pings the GPU and knows when it's non-responsive?
I'm an amateur at Zabbix so any help would be appreciated. Thank you!
Comment