Ad Widget

Collapse

Zabbix queue empty but Items are delayed by minutes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • lars ds
    Junior Member
    • Oct 2022
    • 18

    #1

    Zabbix queue empty but Items are delayed by minutes

    Hi everyone,

    We monitor a large amount of devices for a certain company, atm this is about 374000 Items. Couple of days ago we received a mail regarding a certain problem. Some of the data was being delayed at that time. We saw that there was an increase in the queue so we changed the amount of pollers in our zabbix deployment. All seemed good and well until the problem returned but this time the queue is as good as empty, even if the queue is not empty the expected queue time is never higher then 30-60 Seconds. The delays are between 1min-13min with one being possibly because of polling interval (1m) and 13 minutes being the max that I have seen since the problem occurred. What is weird is that the delay is not consistent. Sometimes there is no delay, other time we are looking at an avg of 8m. I have been looking for causes but I cannot seem to find any with direct proof. Zabbix is not the only application running on that server. We have been noticing that sometimes the server spikes for a couple of hours to 100% CPU (this is not fine, we know this is an issue) but regardless of cpu being at 100% I have seen moments without delays on the zabbix items at these times. What I did notice is when the user count of zabbix is rising so does the delay on these items (all users are super admins). We notice that the front-end is working slow in comparison to one of our test environments (this can be linked to high cpu usage, but if cpu is 50% front-end still seems to appear slow). Here is some extra metric information about our Zabbix deployment. We run this server with 8 VCPU's (azure) and 70GB ram.
    Click image for larger version

Name:	image.png
Views:	953
Size:	49.9 KB
ID:	460120

    I have been looking for a possible cause but can't seem to find one where I can link this problem to. If anyone has been in contact with a problem similar to that one I have described, please feel free to leave tips.
  • LenR
    Senior Member
    • Sep 2009
    • 1005

    #2
    Look at DB performance and history syncer busy. How are you cleaning history, DB partitions or housekeeping? If housekeeping, does the problem occur during busy housekeeping? Are the items gathered by proxies or the main server?

    Comment

    • lars ds
      Junior Member
      • Oct 2022
      • 18

      #3
      DB performance was increased earlier this week, this is azure mysql DB we doubled our IOPS because we were reaching the limit. CPU problem has been resolved aswell. I have added 'Zabbix server health' monitoring template, this template indicates that history syncers are busy for about 100% I have increased dbsyncers from 2 to 10 and syncer buffer from 128 to 512 MB. Items are gathered by main server only. zabbix does run in a kubernetes environment using docker image for version 5.4.9. Maybe better snmp performance in later versions? I am still waiting until I have more information about the housekeeping. Currently waiting for % utilization (from health template) of these housekeepers but also waiting until I can compare. Thx already for good insights.
      Last edited by lars ds; 28-02-2023, 10:25.

      Comment

      • lars ds
        Junior Member
        • Oct 2022
        • 18

        #4
        Our housekeeping frequency is set to "1" (run each hour) and should clean up 5000 items. These 5K Items in my opinion won't do it. We gather 2.5K items every minute meaning our database will keep growing. Sadly the housekeeper can't finish this task it seems like. Click image for larger version

Name:	image.png
Views:	996
Size:	623.9 KB
ID:	460160​Waiting nearly 90 minutes for only 5k items to be removed seems not right. CPU usage is still relatively high but the zabbix housekeeper process does not seems take advantage of this free cpu space. It sits idle with 0% Usage. Click image for larger version

Name:	image.png
Views:	946
Size:	14.7 KB
ID:	460161

        Because this housekeeper is not really delivering actual work during this test, I cannot tell if the delay is being caused by the housekeeper. What I can say is that after configuration of the db syncers no delays were found. (but as I said, housekeeper is not really working so keep that in mind). During this time 3 users have been online so the delay is probably not linked to the user count (for now).

        Comment

        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4807

          #5
          I guess you should see housekeeper related lines in server log... how much it actually removes?
          ie..
          Code:
          zabbix_server.log: 9353:20230228:124549.472 executing housekeeper
          zabbix_server.log: 9353:20230228:124554.087 housekeeper [deleted 0 hist/trends, 0 items/triggers, 1329 events, 486 problems, 20 sessions, 0 alarms, 0 audit, 0 records in 4.568834 sec, idle for 1 hour(s)]
          ​
          Partitioned DB is good... no worries about housekeeper not keeping house... all the big tables a truncated automatically...

          Comment

          • lars ds
            Junior Member
            • Oct 2022
            • 18

            #6
            Hi Cyber, I just restarted the deployment after some changes, I hope by tomorrow I can provide you with these logs. Earlier today I reviewed the logs but the only output I saw was the 'Executing housekeeper' log meaning it started. Since there was no second log I guess it never finished. The following lines of output should indicate that the housekeeper did start with the correct settings. But you are right that it would be better to move to a partitioned DB. This does mean I will need to look for a way to migrate from housekeeping to partitioning without downtime and data loss.

            Click image for larger version

Name:	image.png
Views:	931
Size:	3.3 KB
ID:	460176
            Click image for larger version

Name:	image.png
Views:	940
Size:	7.9 KB
ID:	460177​​
            Attached Files

            Comment

            • lars ds
              Junior Member
              • Oct 2022
              • 18

              #7
              It has been 12-14 hours since I last restarted the deployment. The log file reached the limit but I managed to capture 2 housekeeper outputs.


              It is good to see that the amount of deleted items is this high, this indicates that the housekeeper does really work during this 2:30 hours housekeeping. What I don't understand is why it does not use the MaxHouseKeeperDelete. I was expecting only 5K Items to be deleted but as we can see this is not the case. Now that we know the housekeeper is actually deleting items we can check if the delays occur
              during the housekeeping. I have checked this and I don't see any delays at this time. I will react to this post again once I notice delays again but I think that the issue could have been related to the low amount of DBsyncers that was configured. Thanks Cyber and LenR.​

              Comment

              Working...