Ad Widget

Collapse

Active agent items delayed - false alerts (Host unreachable)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • semiraue
    Junior Member
    • Oct 2018
    • 11

    #1

    Active agent items delayed - false alerts (Host unreachable)

    Hi All

    I'm using zabbix for monitor servers from multiple countries on multiple different time zones. Our existing setup only have passive checks and zabbix sitting on our network. So far no issues and alert coming out almost instantly. But recently we planned to move zabbix outside from our network. I chooses popular and reliable cloud provider and deployed new zabbix server 4.1. Then I export few hosts from old zabbix server and imported to new one. and create separate template for active monitoring and ( and I confirm there is no single passive items on those templates) configure host with active agents. below are my active agent config on centos hosts.

    PidFile=/var/run/zabbix/zabbix_agentd.pid
    LogFile=/var/log/zabbix/zabbix_agentd.log
    LogFileSize=0
    DebugLevel=3
    Server=x.x.x.x
    StartAgents=0
    ServerActive=x.x.x.x
    #Hostname=
    RefreshActiveChecks=60
    Include=/etc/zabbix/zabbix_agentd.d/*.conf

    First two remote servers no issues. Then I export set of servers ( like 20 ) all are running centos but in different regions and confirmed data is receiving in latest data page. But after 5 minutes alerts starting to fire up saying host unreachable for 5 mins. when I check the zabbix-agent active agent ping data, last check time is always more than 5 mins with the server time ( some even 12 mins). Because of that alerts coming up. Then I allow zabbix passive port on our firewalls and configure one of those host with passive check. then the agent ping last check is instant. There is no heavy latency on network. even though I'm getting host unreachable graphs and other data i working flawlessly.

    When I check the zabbix server queue there are lot of items delayed from those hosts. some more than 10 mins. I have no idea what queue window showing. Does it means bottleneck of database ? which is highly unlikely since I'm using high performance cloud sql instance on cloud service. One main reason we switched to active checks it is easy to monitor hosts behind the NAT routers, which in our case, many. but with this issue new deployment becoming useless since it giving so much false alerts. at the moment I have more than 30 hosts with unreachable alert but those host are up and graphing works perfectly

    Is there any way to fix this ? I tried set active agent buffer to minimum size and some other options but non of those worked for me.
    Last edited by semiraue; 22-11-2018, 06:05.
  • mrogers-9898
    Member
    • Sep 2008
    • 68

    #2
    I have got a similar scenario. I upgraded from 3.4 to 4.0 and I'm getting a lot of nodata alerts on my active agents.

    I'm really confused at what I've found so far, but it looks like the agent no data alerts are being calculated off the time on the agent, not the server. I had 4 hosts that I could not get working in terms of nodata alerts, and I found that they had their time out of alignment enough that it made the data historically invalid for the nodata alert. Are the system times on our agents an exact alignment with your server?

    Above aside, even with that weird time thing, I'm getting a lot of no data alerts even on systems with good aligned times. I've upped the debugging level on the agents, and I can see they're all sending data fine. I'd have thought I'd have a database issue with processing, but all of my Zabbix performance values are near idle. My netdata performance on the Zabbix host itself, shows hosts disks working fine, not overworked.

    I'm at a loss where to turn to next as well.

    Comment

    • gjacko197
      Junior Member
      • Jan 2017
      • 7

      #3
      Got the same issue with a few hosts, some had the time issue but ive had to disable some hosts as no matter what, they are not processing quick enough and just show as down when they are not

      The majority of hosts are processing within 30 seconds up to a max of 1 min, but these problem hosts seem to either take longer than the trigger period of 10 mins i have set or just do not process at all

      All was fine before upgrading to version 4.0

      Comment

      • dimir
        Zabbix developer
        • Apr 2011
        • 1080

        #4
        Since 4.0 user is responsible for keeping agent/proxy/server time in sync. Adjusting the timestamps according to the time difference between server and proxy was removed in 4.0:



        Issue where it happened: https://support.zabbix.com/browse/ZBX-12957
        Last edited by dimir; 22-11-2018, 16:51.

        Comment

        • semiraue
          Junior Member
          • Oct 2018
          • 11

          #5
          Confirmed..! Issue is with new zabbix version 4.1. I downgrade to zabbix server version 3.4 and the active checks working perfectly now. There is no host delayed on the queue tab anymore and no false alerts.

          dimir Not clear what you mean by "user is responsible for keeping agent/proxy/server time in sync". Does it mean all the zabbix-agent and server should be in same time zone and time? Or I have to manually create item to get host time and compare Host-unreachable trigger with it ? Please explain

          Comment

          • dimir
            Zabbix developer
            • Apr 2011
            • 1080

            #6
            Quoting the upgrade notes of 4.0:
            Timestamp correction

            Zabbix server will no longer correct timestamps in cases when Zabbix proxy time differs from Zabbix server time.
            Before 4.0 server/proxy were adjusting the value timestamps with the difference in time between client and server (agent-proxy, agent-server, proxy-server, whatever). This was done by comparing the timestamp from the packet with current timestamp on receiving side. I suggest enabling DebugLevel=4 for server/proxy and check the log for the following string:
            Code:
            "delta time from json"
            You can increase log level temporarily for pollers in case of passive checks and trapper in case of active checks, e. g.
            Code:
            zabbix_server -Rlog_level_increase=poller
            ...check the log file...
            zabbix_server -Rlog_level_decrease=poller
            I'm only guessing that this could be causing your issues.

            Comment

            • mrogers-9898
              Member
              • Sep 2008
              • 68

              #7
              This seems to be the root of my problems. I'm getting some odd results for the "delta time from json" test.

              some massive times, some negative times

              timestamp from json 1545088520 seconds and 83703658 nanosecond, delta time from json 109 seconds and 344292014 nanosecond
              timestamp from json 1545080585 seconds and 97322844 nanosecond, delta time from json 8044 seconds and 356013425 nanosecond
              timestamp from json 1545088584 seconds and 245522331 nanosecond, delta time from json 45 seconds and 210932387 nanosecond
              timestamp from json 1545088632 seconds and 347218905 nanosecond, delta time from json -2 seconds and -890282385 nanosecond
              timestamp from json 1545088579 seconds and 708274800 nanosecond, delta time from json 49 seconds and 764608651 nanosecond
              timestamp from json 1545088626 seconds and 726048142 nanosecond, delta time from json 2 seconds and 779601511 nanosecond
              timestamp from json 1545088665 seconds and 826560300 nanosecond, delta time from json -36 seconds and -309845226 nanosecond
              timestamp from json 1545088623 seconds and 329288575 nanosecond, delta time from json 6 seconds and 321831017 nanosecond
              timestamp from json 1545088818 seconds and 554978700 nanosecond, delta time from json -188 seconds and -663221994 nanosecond
              timestamp from json 1545088438 seconds and 444875024 nanosecond, delta time from json 191 seconds and 608670246 nanosecond
              timestamp from json 1545088629 seconds and 847397923 nanosecond, delta time from json 0 seconds and 203353093 nanosecond
              timestamp from json 1545088438 seconds and 42507158 nanosecond, delta time from json 192 seconds and 20089770 nanosecond
              timestamp from json 1545088632 seconds and 315308900 nanosecond, delta time from json -2 seconds and -200238105 nanosecond
              timestamp from json 1545088616 seconds and 184540200 nanosecond, delta time from json 13 seconds and 949654597 nanosecond
              timestamp from json 1545088535 seconds and 961306674 nanosecond, delta time from json 94 seconds and 190971346 nanosecond
              timestamp from json 1545088626 seconds and 110769595 nanosecond, delta time from json 4 seconds and 68602452 nanosecond
              I've checked my agents, and their times are solid, they're not out of sync with the Zabbix server. Can you suggest where I can dig next?

              Comment

              • mrogers-9898
                Member
                • Sep 2008
                • 68

                #8
                More weird.

                I've a item that checks local time on agent every 30 seconds. When checking that, the value itself, it's return time is spot on (+/-30sec to be expected) but the item "Last checked time" in Zabbix is 5 minutes out.

                Attached Files

                Comment

                • dimir
                  Zabbix developer
                  • Apr 2011
                  • 1080

                  #9
                  Simple solution, set up time synchronization (NTPD) on every host involved: server, proxies, agents.

                  Comment

                  • mrogers-9898
                    Member
                    • Sep 2008
                    • 68

                    #10
                    The suggestion here is that the time is out of sync? I've solved that problem - server and agents are in sync.

                    I must have some other, additional, issue here.

                    If my items are being processed slowly on my server, would that cause these kinds of troubles?

                    I see I have a big queue, 6k items. If I pick an agent out of the queue and check it out, nothing jumps out as a problem. The agent is sending data promptly. My server seems to not be having any trouble with the load - what causes items to get stuck in the queue if the agent is (seemingly) sending them on time?
                    Attached Files

                    Comment

                    • dimir
                      Zabbix developer
                      • Apr 2011
                      • 1080

                      #11
                      From your screenshot it's visible that there is an issue with Housekeeper. For example in first spike it hit 100% busy and stayed that way for longer than an hour. During that time you will have issues and that's why probably there's lots of items queued. If you have lot's of data it is suggested to use database partitioning, otherwise it's not possible to fix issues like that.
                      There some articles on zabbix.org and
                      The Zabbix Team has collected all official Zabbix monitoring templates and integrations.

                      Also it was already discussed here, e. g.


                      Comment

                      • mrogers-9898
                        Member
                        • Sep 2008
                        • 68

                        #12
                        Yep, there was definitely a big blip there. That was very likely to me shortening a huge amount of item durations and retentions, in order to try reduce load - to see if that helped the queue.

                        This 7 day graph shows a much calmer keeper.

                        I don't think my install is a particularly big one, and my hardware is in the "ok" range. My DB is 60 GB, disk are 15k SAS RAID10.

                        Or do I have my wires crossed and it's still a housekeeper problem?
                        Attached Files

                        Comment

                        • dimir
                          Zabbix developer
                          • Apr 2011
                          • 1080

                          #13
                          Can you tell what type of items are being in the queue for longer time?

                          Comment

                          • mrogers-9898
                            Member
                            • Sep 2008
                            • 68

                            #14
                            Hi Dimir,

                            They look to be all types, string, ints. Data on services, data on disks.

                            Attached Files

                            Comment

                            • mrogers-9898
                              Member
                              • Sep 2008
                              • 68

                              #15
                              I may have my context off there - these all active items.

                              Comment

                              Working...