Ad Widget

Collapse

Many Servers are unreachable for more than 5 minutes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • moneynut
    Member
    • Mar 2014
    • 37

    #46
    Originally posted by tchjts1
    It might look bad, but the good news is that now you know what needs to be adjusted. On your Zabbix server in zabbix_server.conf:

    Increase:
    StartPollers=xx (Increase by 10 or 15 at a time until stable)
    StartPollersUnreachable=xx (try 10 or 15)
    StartDiscoverers=2
    TrendCacheSize=64M

    Leave StartDBSyncers=4 (Do not change)

    Make sure all the above lines (if changed from default) are not preceded by a comment # or they will use the default settings. Restart your Zabbix server process. Give it 10 or 15 minutes then check your graphs graphs agin. Adjust settings as further necessary.

    When I adjust any of those settings, I prefer to leave the default line in place and put my new value onto a new line such as this:

    ### Option: StartPollers
    # Number of pre-forked instances of pollers.
    #
    # Mandatory: no
    # Range: 0-1000
    # Default:
    # StartPollers=5
    StartPollers=350
    Thank you so much. I'm not getting anymore false alerts except for 3 servers.
    Those 3 servers are windows and are on latest version zabbix agent.
    I checked the network graphs, and it's incoming and outgoing graphs is so broken. Looks like packet loss or something. But when I do continuous ping from zabbix to those 3 windows server, it does not lose any packets.

    In the sever logs, I always see first network error for all network related items ("net.if.in, "net.if.out, vfs.fs.size etc) only for those 3 windows host. But it does restore connection immediately. And as for the items, there are no "Not-Supported" and there are no errors in Configuration -> Hosts -> items.


    Server Logs:
    27022:20140327:102035.583 resuming Zabbix agent checks on host "HOST-PROD-011": connection restored
    27031:20140327:102040.145 resuming Zabbix agent checks on host "HOST-PROD-020": connection restored
    27028:20140327:102043.022 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    26956:20140327:102102.883 Zabbix agent item "net.if.out[Broadcom BCM5709S NetXtreme II GigE (NDIS VBD Client)-WFP LightWeight Filter-0000]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    26959:20140327:102103.451 Zabbix agent item "net.if.out[WAN Miniport (Network Monitor)-QoS Packet Scheduler-0000]" on host "HOST-PROD-011" failed: first network error, wait for 15 seconds
    27006:20140327:102118.775 Zabbix agent item "vfs.fs.size[C:,used]" on host "HOST-PROD-020" failed: first network error, wait for 15 seconds
    27025:20140327:102118.775 resuming Zabbix agent checks on host "HOST-PROD-011": connection restored
    27030:20140327:102127.351 Zabbix agent item "vfs.fs.size[C:,free]" on host "HOST-PROD-010" failed: another network error, wait for 15 seconds
    27033:20140327:102128.071 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    27032:20140327:102133.075 resuming Zabbix agent checks on host "HOST-PROD-020": connection restored
    27008:20140327:102139.318 Zabbix agent item "vfs.fs.size[C:,free]" on host "HOST-PROD-011" failed: first network error, wait for 15 seconds
    26996:20140327:102142.990 Zabbix agent item "net.if.in[Broadcom BCM5709S NetXtreme II GigE (NDIS VBD Client)-WFP LightWeight Filter-0000]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    26966:20140327:102153.678 Zabbix agent item "net.if.in[Microsoft ISATAP Adapter]" on host "HOST-PROD-020" failed: first network error, wait for 15 seconds
    27032:20140327:102154.336 resuming Zabbix agent checks on host "HOST-PROD-011": connection restored
    27017:20140327:102157.454 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    27000:20140327:102207.780 Zabbix agent item "net.if.out[RAS Async Adapter]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    27018:20140327:102208.206 resuming Zabbix agent checks on host "HOST-PROD-020": connection restored
    26976:20140327:102217.880 Zabbix agent item "net.if.out[WAN Miniport (L2TP)]" on host "HOST-PROD-011" failed: first network error, wait for 15 seconds
    27024:20140327:102222.160 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    26982:20140327:102226.483 Zabbix agent item "system.cpu.load[,avg1]" on host "HOST-PROD-020" failed: first network error, wait for 15 seconds
    27037:20140327:102232.101 resuming Zabbix agent checks on host "HOST-PROD-011": connection restored
    26954:20140327:102232.217 Zabbix agent item "system.cpu.load[,avg5]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    27013:20140327:102241.183 resuming Zabbix agent checks on host "HOST-PROD-020": connection restored
    27010:20140327:102247.161 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    26999:20140327:102302.950 Zabbix agent item "net.if.out[Broadcom BCM5709S NetXtreme II GigE (NDIS VBD Client)-QoS Packet Scheduler-0000]" on host "HOST-PROD-020" failed: first network error, wait for 15 seconds
    26976:20140327:102309.493 Zabbix agent item "net.if.out[WAN Miniport (IKEv2)]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    27000:20140327:102309.588 Zabbix agent item "net.if.out[Broadcom BCM5709S NetXtreme II GigE (NDIS VBD Client)-WFP LightWeight Filter-0000]" on host "HOST-PROD-011" failed: first network error, wait for 15 seconds
    27036:20140327:102317.046 resuming Zabbix agent checks on host "HOST-PROD-020": connection restored
    27012:20140327:102324.380 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    27018:20140327:102324.383 resuming Zabbix agent checks on host "HOST-PROD-011": connection restored
    26960:20140327:102427.685 Zabbix agent item "perf_counter[\2\18]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    27005:20140327:102429.780 Zabbix agent item "net.if.in[Broadcom BCM5709S NetXtreme II GigE (NDIS VBD Client) #2-WFP LightWeight Filter-0000]" on host "HOST-PROD-011" failed: first network error, wait for 15 seconds
    27021:20140327:102442.607 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    27029:20140327:102444.439 resuming Zabbix agent checks on host "HOST-PROD-011": connection restored
    26987:20140327:102452.732 Zabbix agent item "net.if.in[Microsoft ISATAP Adapter]" on host "HOST-PROD-010" failed: first network error, wait for 15 seconds
    26998:20140327:102454.287 Zabbix agent item "net.if.in[Microsoft ISATAP Adapter #2]" on host "HOST-PROD-020" failed: first network error, wait for 15 seconds
    26977:20140327:102501.415 Zabbix agent item "net.if.out[WAN Miniport (IPv6)]" on host "HOST-PROD-011" failed: first network error, wait for 15 seconds
    27026:20140327:102507.538 resuming Zabbix agent checks on host "HOST-PROD-010": connection restored
    27018:20140327:102509.479 resuming Zabbix agent checks on host "HOST-PROD-020": connection restored

    Comment

    • aib
      Senior Member
      • Jan 2014
      • 1615

      #47
      Originally posted by moneynut
      Thank you so much. I'm not getting anymore false alerts except for 3 servers.
      Those 3 servers are windows and are on latest version zabbix agent.
      I checked the network graphs, and it's incoming and outgoing graphs is so broken. Looks like packet loss or something. But when I do continuous ping from zabbix to those 3 windows server, it does not lose any packets.
      Check, please, Timeout settings on that 3 servers and increase it if they have default value 3 sec.
      Code:
      # find "Timeout" "C:\Program Files(x86)\Zabbix Agent\Zabbix_agent.win.conf"
      ---------- C:\Program Files(x86)\Zabbix Agent\Zabbix_agent.win.conf
      ### Option: Timeout
      #       Spend no more than Timeout seconds on processing
      # Timeout=3
      Timeout=10
      Also you can increase Timeout in Zabbix Server configuration file.
      And don't forget to restart zabbix agent service on Windows machine or zabbix server after any changes in .conf file
      Sincerely yours,
      Aleksey

      Comment

      • moneynut
        Member
        • Mar 2014
        • 37

        #48
        Originally posted by aib
        Check, please, Timeout settings on that 3 servers and increase it if they have default value 3 sec.
        Code:
        # find "Timeout" "C:\Program Files(x86)\Zabbix Agent\Zabbix_agent.win.conf"
        ---------- C:\Program Files(x86)\Zabbix Agent\Zabbix_agent.win.conf
        ### Option: Timeout
        #       Spend no more than Timeout seconds on processing
        # Timeout=3
        Timeout=10
        Also you can increase Timeout in Zabbix Server configuration file.
        And don't forget to restart zabbix agent service on Windows machine or zabbix server after any changes in .conf file
        Timeout was set to 10 already and it was restarted. I've even tested with timeout option of 15

        Comment

        • moneynut
          Member
          • Mar 2014
          • 37

          #49
          Originally posted by moneynut
          Timeout was set to 10 already and it was restarted. I've even tested with timeout option of 15
          Changed timeout to 20 in (both server and agent and restarted) but still i get the same errors :-(

          Comment

          • aib
            Senior Member
            • Jan 2014
            • 1615

            #50
            Check the connectivity manually
            From the server side do the command:
            Code:
            # ping {IP_of_WINDOWS_PC} -s 1500 -c 100
            Check the last two lines:
            Code:
            100 packets transmitted, 100 received, 0% packet loss, time 90071ms
            rtt min/avg/max/mdev = 0.346/0.415/0.564/0.065 ms
            If you have anything wrong - it better to fix it before dealing with Zabbix.
            Sincerely yours,
            Aleksey

            Comment

            • moneynut
              Member
              • Mar 2014
              • 37

              #51
              Originally posted by aib
              Check the connectivity manually
              From the server side do the command:
              Code:
              # ping {IP_of_WINDOWS_PC} -s 1500 -c 100
              Check the last two lines:
              Code:
              100 packets transmitted, 100 received, 0% packet loss, time 90071ms
              rtt min/avg/max/mdev = 0.346/0.415/0.564/0.065 ms
              If you have anything wrong - it better to fix it before dealing with Zabbix.
              Like I mentioned before, there is not packet loss. I had already made sure there is network issues.
              Anyway, here you go:

              100 packets transmitted, 100 received, 0% packet loss, time 99232ms
              rtt min/avg/max/mdev = 1.318/1.538/4.732/0.405 ms

              Comment

              • aib
                Senior Member
                • Jan 2014
                • 1615

                #52
                Originally posted by moneynut
                Like I mentioned before, there is not packet loss. I had already made sure there is network issues.
                Anyway, here you go:

                100 packets transmitted, 100 received, 0% packet loss, time 99232ms
                rtt min/avg/max/mdev = 1.318/1.538/4.732/0.405 ms
                wow! so long time of answer - almost 5 seconds.
                It can be an issue - and I'm sorry, I don't know any other way to fix your problem. You already try to increase Timeout almost to maximal value (30 sec) and it doesn't help much.
                Sincerely yours,
                Aleksey

                Comment

                • moneynut
                  Member
                  • Mar 2014
                  • 37

                  #53
                  Originally posted by aib
                  wow! so long time of answer - almost 5 seconds.
                  It can be an issue - and I'm sorry, I don't know any other way to fix your problem. You already try to increase Timeout almost to maximal value (30 sec) and it doesn't help much.
                  There are lots of other servers in our network and their response worse than these windows servers yet they don't have problem with Zabbix :-(

                  Deleting and re-adding the host does not help either.

                  Comment

                  • tchjts1
                    Senior Member
                    • May 2008
                    • 1605

                    #54
                    I think there is one other approach you may try, and that is to use Zabbix agent (active) items instead of passive items. For this to work properly, whatever you have on your monitored host for Hostname= must be an exact match (including case) as to the name you have for the host in the Zabbix frontend.

                    You then must also have ServerActive= defined on your Zabbix agents in zabbix_agentd.conf (that would be the IP or DNS name of your Zabbix server)

                    Once you set those up (and restarted the Zabbix agent service), one way to test it would be to change an item that is failing on these 3 servers to Zabbix agent (active) and see if that item then starts reporting in. It will take some time for the configuration to take effect and depending on your polling interval as well.
                    Last edited by tchjts1; 27-03-2014, 18:15.

                    Comment

                    • aib
                      Senior Member
                      • Jan 2014
                      • 1615

                      #55
                      Originally posted by tchjts1
                      You then must also have ServerActive= defined on your Zabbix server in zabbix_server.conf (that would be the IP or DNS name of your Zabbix server)
                      Easy, man!
                      Don't make him wrong!!
                      the keyword ServerActive has to be defined in zabbix_agent.conf file on Windows client!

                      You cannot configure ServerActive on Zabbix server in zabbix_server.conf
                      Sincerely yours,
                      Aleksey

                      Comment

                      • moneynut
                        Member
                        • Mar 2014
                        • 37

                        #56
                        Originally posted by aib
                        Easy, man!
                        Don't make him wrong!!
                        the keyword ServerActive has to be defined in zabbix_agent.conf file on Windows client!

                        You cannot configure ServerActive on Zabbix server in zabbix_server.conf
                        Thanks. And our zabbix only does passive checks. We did not enable active checks.

                        Comment

                        • moneynut
                          Member
                          • Mar 2014
                          • 37

                          #57
                          I'll try active checks soon. Right now, I set TimeOut to 30 seconds and I am not getting emails about the servers being unreachable but it in logs I still see failed: another network error, wait for 15 seconds. I'll wait for a while before assuming timeout to max increase did stop alerts.

                          Comment

                          • tchjts1
                            Senior Member
                            • May 2008
                            • 1605

                            #58
                            Originally posted by aib

                            You cannot configure ServerActive on Zabbix server in zabbix_server.conf
                            Yeah, I was multitasking when I typed that out. I fixed it now.

                            Comment

                            Working...