How to detect when a GPU becomes unresponsive?

joe-synthetaic

Junior Member

Joined: Jun 2021

Posts: 1
#1

How to detect when a GPU becomes unresponsive?

15-06-2021, 22:03

My company has a server with 8 Nvidia RTX 8000 GPUs in it (for AI training). One of the GPUs has been being problematic and will randomly crash and not show up anymore. I want to create a trigger in Zabbix for when that happens so I can get alerted. Right now we are using the Nvidia GPU template that shows items for GPU power, fan speed, utilization, etc.

I created a disaster trigger for when power consumption hits 0 ({lambda-server:gpu.power[0].last()}=0), figuring that would work. But since the GPU just goes undetected, Zabbix doesn't have a chance to read that it's not getting power, so technically that trigger never hits 0 before the GPU goes out and I don't get alerted. Does anyone know how to create a trigger that just pings the GPU and knows when it's non-responsive?

I'm an amateur at Zabbix so any help would be appreciated. Thank you!
Tags: gpu, nvidia, ping, trigger
Markku

Senior Member

Joined: Sep 2018

Posts: 1784
#2

16-06-2021, 10:22

Hi, how about trying with .nodata(x) trigger?

Markku
Comment

Ad Widget