I'm still trying to figure out what would be something of a best practice for handling userparameters that sometimes have very transient failure conditions.
Best example I'm dealing with right now, I have a userparameter for reporting the client's time offset from it's currently selected NTP peer, using a userparameter I believe I picked up on another thread here:
This works fine but there are times, such as when a server is rebooted or NTP is restarted, that a peer has not been selected yet (no time server is marked with '*') but will be within the next few seconds. In order to prevent the agent from returning the item as not supported if it happens to check during that narrow window, I set a default arbitrarily high offset to return so that a valid value is returned from Zabbix's perspective but it's also obvious for triggers/charts that something is off if the condition persists. (Since if I'm reading the documentation and release notes right, non-text items will still go unsupported if no data is returned.)
The problem is this horribly skews charts when it occurs, usually just for one sample, and it really doesn't reflect the actual data. My preference would be for there to simply be gaps in charts when no data is returned. But I also don't want Zabbix to mark the item as not supported and then not check it again for an extended delay, when almost certainly this item will be working again within 30 seconds. But I certainly don't want to globally tell zabbix to constantly recheck all unsupported items frequently.
So...what is the best approach for these kinds of statistics? How can I gather data regularly, allow for brief periods of no data, without having agents go unsupported for far longer than necessary? Or am I misunderstanding the process of items going unsupported? One thing that occurs to me, it might be nice to be able to optionally specify for an individual item a "retry count" of how many tries before being marked unsupported.
I'm still trying to come up to speed on how zabbix works under the hood, so it's very possible I'm just entirely misinterpreting something here from the beginning. I've been looking through the docs but haven't found a lot of details on exactly how and when something goes into the "Not Supported" state, if it happens immediately upon a single response of ZBX_NOTSUPPORTED or if there is already an internal retry count or some other logic involved of some kind.
(I'm working with Zabbix 2.0+, just for clarity.)
Best example I'm dealing with right now, I have a userparameter for reporting the client's time offset from it's currently selected NTP peer, using a userparameter I believe I picked up on another thread here:
Code:
UserParameter=ntp.client.offset,/usr/sbin/ntpq -pn | /usr/bin/awk 'BEGIN { offset=9999 } $1 ~ /\*/ { offset=$9 } END { print offset }'
The problem is this horribly skews charts when it occurs, usually just for one sample, and it really doesn't reflect the actual data. My preference would be for there to simply be gaps in charts when no data is returned. But I also don't want Zabbix to mark the item as not supported and then not check it again for an extended delay, when almost certainly this item will be working again within 30 seconds. But I certainly don't want to globally tell zabbix to constantly recheck all unsupported items frequently.
So...what is the best approach for these kinds of statistics? How can I gather data regularly, allow for brief periods of no data, without having agents go unsupported for far longer than necessary? Or am I misunderstanding the process of items going unsupported? One thing that occurs to me, it might be nice to be able to optionally specify for an individual item a "retry count" of how many tries before being marked unsupported.
I'm still trying to come up to speed on how zabbix works under the hood, so it's very possible I'm just entirely misinterpreting something here from the beginning. I've been looking through the docs but haven't found a lot of details on exactly how and when something goes into the "Not Supported" state, if it happens immediately upon a single response of ZBX_NOTSUPPORTED or if there is already an internal retry count or some other logic involved of some kind.
(I'm working with Zabbix 2.0+, just for clarity.)