Ad Widget

Collapse

Many Servers are unreachable for more than 5 minutes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • moneynut
    Member
    • Mar 2014
    • 37

    #1

    Many Servers are unreachable for more than 5 minutes

    It was all working fine until 2 days ago. Now Many Windows servers are being Unreachable more than 5 minutes. It does recover after an hour or two and then again after 2 hours it says it's unreachable. We did not make any changes.

    I'm able to telnet and ping without any issue when it says it's unreachable.
    Tried restarting Zabbix agent and zabbix server as well.
    When I restart Zabbix agent sometimes it does recover (still fails again though) and sometimes it does not.

    Looks like something is wrong as per the Server logs:

    bc@zabbix_server:/etc/zabbix$tail -f /var/log/zabbix/zabbix_server.log
    snmp_build: unknown failuresnmp_build: unknown failure 994:20140313:053931.839 enabling SNMP checks on host [HOST-092]: host became available
    snmp_build: unknown failuresnmp_build: unknown failure 993:20140313:054235.345 resuming Zabbix agent checks on host [HOST-002]: connection restored
    986:20140313:054347.246 Zabbix agent item [net.if.out[WAN Miniport (PPPOE)]] on host [HOST003] failed: first network error, wait for 15 seconds
    984:20140313:054501.894 Zabbix agent item [perf_counter[\2\18]] on host [HOST-192] failed: first network error, wait for 15 seconds
    988:20140313:054517.968 Zabbix agent item [system.cpu.load[,avg5]] on host [HOST-040] failed: first network error, wait for 15 seconds
    994:20140313:055214.461 resuming Zabbix agent checks on host [HOST003]: connection restored
    993:20140313:055220.382 resuming Zabbix agent checks on host [HOST-040]: connection restored
    991:20140313:055239.542 Zabbix agent item [vfs.fs.size[C:,pfree]] on host [HOST-125] failed: first network error, wait for 15 seconds
    985:20140313:055258.236 Zabbix agent item [net.if.out[WAN Miniport (PPTP)]] on host [HOST-002] failed: first network error, wait for 15 seconds
    snmp_build: unknown failure 985:20140313:055722.742 SNMP item [sysContact] on host [HOST-092] failed: first network error, wait for 15 seconds
    993:20140313:055942.402 resuming Zabbix agent checks on host [HOST-192]: connection restored

    And here is the Agent LOG:

    This Agent LOG is from HOST-192 as you can see, I had restarted the agent hours ago and it was working fine but now it's unreachable again.

    1448:20140122:111621.540 Starting Zabbix Agent [HOST-192]. Zabbix 2.0.6 (revision 35155).
    2064:20140122:111621.587 agent #0 started [collector]
    2068:20140122:111621.587 agent #1 started[listener]
    2072:20140122:111621.587 agent #2 started[listener]
    2080:20140122:111621.587 agent #4 started [active checks]
    2076:20140122:111621.587 agent #3 started[listener]
    2080:20140122:111642.632 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
    1600:20140312:221226.684 Zabbix Agent shutdown requested
    2080:20140312:221227.043 zabbix_agentd active check stopped
    2064:20140312:221227.636 zabbix_agentd collector stopped
    1600:20140312:221227.714 Zabbix Agent stopped. Zabbix 2.0.6 (revision 35155).
    2864:20140312:221232.191 Starting Zabbix Agent [HOST-192]. Zabbix 2.0.6 (revision 35155).
    5548:20140312:221232.191 agent #0 started [collector]
    5252:20140312:221232.191 agent #1 started[listener]
    752:20140312:221232.191 agent #2 started[listener]
    3928:20140312:221232.191 agent #3 started[listener]
    3932:20140312:221232.191 agent #4 started [active checks]
    3932:20140312:221253.202 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)



    This Agent LOG is from HOST-002 and I did not restart the agent. It's keeps recovering itself and becomes unreachable.

    1924:20140122:114556.752 Starting Zabbix Agent [HOST-002]. Zabbix 2.0.6 (revision 35155).
    1928:20140122:114556.767 agent #0 started [collector]
    1932:20140122:114556.767 agent #1 started[listener]
    1936:20140122:114556.767 agent #2 started[listener]
    1940:20140122:114556.767 agent #3 started[listener]
    1944:20140122:114556.767 agent #4 started [active checks]
    1944:20140122:114617.780 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
    1900:20140122:115400.850 Zabbix Agent shutdown requested
    1944:20140122:115401.209 zabbix_agentd active check stopped
    1928:20140122:115401.240 zabbix_agentd collector stopped
    1900:20140122:115401.864 Zabbix Agent stopped. Zabbix 2.0.6 (revision 35155).
    1908:20140122:121024.890 Starting Zabbix Agent [HOST-002]. Zabbix 2.0.6 (revision 35155).
    1912:20140122:121024.906 agent #0 started [collector]
    1916:20140122:121024.906 agent #1 started[listener]
    1920:20140122:121024.906 agent #2 started[listener]
    1924:20140122:121024.906 agent #3 started[listener]
    1928:20140122:121024.906 agent #4 started [active checks]
    1928:20140122:121045.950 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
  • moneynut
    Member
    • Mar 2014
    • 37

    #2
    Bump! Can somebody please assist me. I've been getting around 100 emails every few minutes..

    Comment

    • moneynut
      Member
      • Mar 2014
      • 37

      #3
      The expression i'm using is
      {Template App Zabix Agent:agent.ping.nodata(5m)}=1

      Someone needs to help us. I know many had this issue and changing timeout worked for some. My timeout is 10 in zabbix config file

      Comment

      • moneynut
        Member
        • Mar 2014
        • 37

        #4
        Come on guys... So many certified guys and some experts without certification and yet I don't get help? I wish if there was enough online video tutorials that would make me learn enough to help others. Yes I'm active on other forums and I'm helpful to others. Take a minute and share your knowledge even if you're wrong, I just want opinions and advice or a solution.

        Comment

        • steveboyson
          Senior Member
          • Jul 2013
          • 582

          #5
          Try increasing the timeouts both on server and agent. It seems that the agent does not respond in the given period.
          Also note that windows' perfmon counters are quite time-sensible and get executed only when the windows box thinks that this is ok (roughly said).

          You could try to set the failing items to "passive" if that is possible and watch what is happening then.

          Oh and pls. do not think that nobody tries to help just because we don't want to. Perhaps just nobody knows what's failing on your environment?

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by moneynut
            Come on guys... So many certified guys and some experts without certification and yet I don't get help? I wish if there was enough online video tutorials that would make me learn enough to help others. Yes I'm active on other forums and I'm helpful to others. Take a minute and share your knowledge even if you're wrong, I just want opinions and advice or a solution.
            Don't get me wrong but zabbix it monitoring tools and nothing more
            Zabbix cannot answer on questions like "why I have active trigger A?"
            You are not asking for share knowledge but for support.

            Causes of what you are observing can be many and in most cases all these causes have nothing to do with zabbix.
            If you see log message "cannot connect to [[10.48.12.163]:10051]" try first to confirm this diagnosing this using telnet command (telnet 10.48.12.163 10051). After this you can ask yourself "why it is not possible to establish this connectivity?".

            Without these steps changing anything in zabbix setup is pointless.
            Why? Again: because in this case with almost 100% certainty what you see has nothing to do with zabbix antit is network layer problem (wrong routing, wrong FW policy, doggy cable, network device/part ..)
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • moneynut
              Member
              • Mar 2014
              • 37

              #7
              Originally posted by kloczek
              Don't get me wrong but zabbix it monitoring tools and nothing more
              Zabbix cannot answer on questions like "why I have active trigger A?"
              You are not asking for share knowledge but for support.

              Causes of what you are observing can be many and in most cases all these causes have nothing to do with zabbix.
              If you see log message "cannot connect to [[10.48.12.163]:10051]" try first to confirm this diagnosing this using telnet command (telnet 10.48.12.163 10051). After this you can ask yourself "why it is not possible to establish this connectivity?".

              Without these steps changing anything in zabbix setup is pointless.
              Why? Again: because in this case with almost 100% certainty what you see has nothing to do with zabbix antit is network layer problem (wrong routing, wrong FW policy, doggy cable, network device/part ..)
              Well I've already tested telnet and whatnot. I see no network issues either. I wonder what caused zabbix sending false alerts.

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Originally posted by moneynut
                Well I've already tested telnet and whatnot. I see no network issues either. I wonder what caused zabbix sending false alerts.
                Could you please show what exactly you did and what exactly was displayed by telnet command test?
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • moneynut
                  Member
                  • Mar 2014
                  • 37

                  #9
                  Originally posted by kloczek
                  Could you please show what exactly you did and what exactly was displayed by telnet command test?
                  Telnet zabbix_server_ip 10051
                  Conencted to zabbix_server_ip 10051
                  Escape character is '^]'.

                  Like I said there is no connection issue. Zabbix server can connect to agent host on port 10050 as well.

                  Comment

                  • moneynut
                    Member
                    • Mar 2014
                    • 37

                    #10
                    Can someone at least give me the link to pre-compiled 2.2 zabbix server for Ubuntu server 12.04, x86-64??
                    I'm planning to upgrade from 2.0 to 2.2

                    Comment

                    • steveboyson
                      Senior Member
                      • Jul 2013
                      • 582

                      #11
                      For all those people who find it more convenient to bother you with their question rather than to Google it for themselves.

                      Comment

                      • moneynut
                        Member
                        • Mar 2014
                        • 37

                        #12
                        Originally posted by steveboyson
                        I was searching in zabbix repo directory.


                        But anyway I already got the link from the documentation.

                        Comment

                        • tchjts1
                          Senior Member
                          • May 2008
                          • 1605

                          #13
                          Did you also follow Steve's advice on incrementing your Timeout= value in your Zabbix server (zabbix_server.conf ) as well as on your agents ( zabbix_agentd.conf ) ?

                          The default value of 3 seems to cause this issue quite a bit. maybe try 10 or 15 for this value. If you modify conf values, you need to restart the Zabbix processes.

                          Comment

                          • moneynut
                            Member
                            • Mar 2014
                            • 37

                            #14
                            Timeout value is 10 since the very first day it was installed. I've tried it with 15 as well. Same issue.

                            Comment

                            • tchjts1
                              Senior Member
                              • May 2008
                              • 1605

                              #15
                              The next thing to look at then is how your Zabbix internal processes are allocated and being used.

                              Look at this post, the final paragraph and the graphs that follow. Then take a look at what your setup is doing. This will tell a lot about whether you need to tune other settings besides the Timeout= value. It is possible you simply do not have enough pollers/trappers/cache configured to handle the workload.

                              Comment

                              Working...