Hi everyone,
We monitor a large amount of devices for a certain company, atm this is about 374000 Items. Couple of days ago we received a mail regarding a certain problem. Some of the data was being delayed at that time. We saw that there was an increase in the queue so we changed the amount of pollers in our zabbix deployment. All seemed good and well until the problem returned but this time the queue is as good as empty, even if the queue is not empty the expected queue time is never higher then 30-60 Seconds. The delays are between 1min-13min with one being possibly because of polling interval (1m) and 13 minutes being the max that I have seen since the problem occurred. What is weird is that the delay is not consistent. Sometimes there is no delay, other time we are looking at an avg of 8m. I have been looking for causes but I cannot seem to find any with direct proof. Zabbix is not the only application running on that server. We have been noticing that sometimes the server spikes for a couple of hours to 100% CPU (this is not fine, we know this is an issue) but regardless of cpu being at 100% I have seen moments without delays on the zabbix items at these times. What I did notice is when the user count of zabbix is rising so does the delay on these items (all users are super admins). We notice that the front-end is working slow in comparison to one of our test environments (this can be linked to high cpu usage, but if cpu is 50% front-end still seems to appear slow). Here is some extra metric information about our Zabbix deployment. We run this server with 8 VCPU's (azure) and 70GB ram.
I have been looking for a possible cause but can't seem to find one where I can link this problem to. If anyone has been in contact with a problem similar to that one I have described, please feel free to leave tips.
We monitor a large amount of devices for a certain company, atm this is about 374000 Items. Couple of days ago we received a mail regarding a certain problem. Some of the data was being delayed at that time. We saw that there was an increase in the queue so we changed the amount of pollers in our zabbix deployment. All seemed good and well until the problem returned but this time the queue is as good as empty, even if the queue is not empty the expected queue time is never higher then 30-60 Seconds. The delays are between 1min-13min with one being possibly because of polling interval (1m) and 13 minutes being the max that I have seen since the problem occurred. What is weird is that the delay is not consistent. Sometimes there is no delay, other time we are looking at an avg of 8m. I have been looking for causes but I cannot seem to find any with direct proof. Zabbix is not the only application running on that server. We have been noticing that sometimes the server spikes for a couple of hours to 100% CPU (this is not fine, we know this is an issue) but regardless of cpu being at 100% I have seen moments without delays on the zabbix items at these times. What I did notice is when the user count of zabbix is rising so does the delay on these items (all users are super admins). We notice that the front-end is working slow in comparison to one of our test environments (this can be linked to high cpu usage, but if cpu is 50% front-end still seems to appear slow). Here is some extra metric information about our Zabbix deployment. We run this server with 8 VCPU's (azure) and 70GB ram.
I have been looking for a possible cause but can't seem to find one where I can link this problem to. If anyone has been in contact with a problem similar to that one I have described, please feel free to leave tips.
no worries about housekeeper not keeping house... all the big tables a truncated automatically...
Comment