Ad Widget

Collapse

Many "network errors" in log with 1.1beta8

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • phpkid
    Junior Member
    • Sep 2004
    • 14

    #1

    Many "network errors" in log with 1.1beta8

    I have recently upgraded my zabbix server and agents from beta4 to beta8. I noticed now I get alot of "network errors" in my zabbix server log and i get false unreachable pages. This only started after the upgrade...I still have just a few old agents (beta4) left to upgrade but I also get false unreachable to them as well. I have done many traceroutes, pings, tcpdumps with no evidence of any problems. The time to reach the servers is 6ms in traceroutes .Here is a snippit of my log.....

    ---------
    006445:20060414:100730 Timeout while connecting to [Db2]
    006445:20060414:100730 Started network errors for [Db2]
    006445:20060414:100730 Host [Db2]: another network error, wait for 11 seconds
    006452:20060414:101532 Timeout while connecting to [Db2]
    006452:20060414:101532 Host [Db2] will be checked after [60] seconds
    006437:20060414:101600 Enabling host [Db2]
    006452:20060414:101600 Enabling host [Db2]
    006435:20060414:101603 Enabling host [Db2]
    006453:20060414:102332 Timeout while connecting to [Web5]
    006453:20060414:102332 Started network errors for [Web5]
    006453:20060414:102332 Host [Web5]: another network error, wait for 11 seconds
    -----------

    I changed my server.conf to increase the suckers and trappers...see below.

    -------
    Server=10
    StartSuckers=24
    StartTrappers=16
    -------

    I have approx 12 servers I monitoring with approx 15 items each....

    Has anyone else see this or know a direction I can take?

    TIA,

    Phpkid
  • elkor
    Senior Member
    • Jul 2005
    • 299

    #2
    zabbix opens a LOT of tcp sockets to target machines.. you could be running out of connections, but shouldn't be with 15 items each on 12 servers unless they are updating every second or something.

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      This may happen if you use NoTimeWait option. Disable it on both sides (server and agent).
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • phpkid
        Junior Member
        • Sep 2004
        • 14

        #4
        Thanks Alexei..... I looked at that but I am not using. I am running server part on rhel3 and have a mix of rhel3 and rhel4 of servers I am monitoring. It happens on any of them with no specific times or settings that I can clue in on. Granted the zabbix server is not in same facility as monitored servers but the connections are fast (6ms). Not sure where to go from here as it started only after upgrading zabbix server from beta4. Is there an easy way too downgrade to test?

        TIA,

        Phpkid

        Comment

        • phpkid
          Junior Member
          • Sep 2004
          • 14

          #5
          Just a quick note....I have noticed that "host status" no longer works for any of the monitored servers. Is this no longer supported? Was it replaced by another item? Should I just use the ping item with agent?


          Alexei...I think we need some clarification on this.

          Thanks,

          Phpkid

          Comment

          • phpkid
            Junior Member
            • Sep 2004
            • 14

            #6
            Well to update my last post it looks like "host status" is working as I have received notifications of servers unreachable (false ones though). But looking at "latest data" for the any of the servers I see the "host status check" has not been done according to this in several days. Does it now display the last time an event happened? Or does no longer display the last check time? Bug?


            Phpkid.

            Comment

            • phpkid
              Junior Member
              • Sep 2004
              • 14

              #7
              Update

              Another update to post. I have been monitoring the zabbix server and agents closely. I continue to get false pages(due to network errors reported in zabbix log). All agents are version 1.1beta 8(including server). There are no network errors on either side that I can see. When zabbix log reports the error there is still connectivity to the servers. I see no gaps or data loss in graphs, it appears agents are still talking. I have increased the zabbix_server setting "UnavailablePeriod" to 90 to help alleviate but it still happens. I am going to increase this etting again to 120 and see what happens. Any ideas would greatly be appreciated, I am at a loss of source of problem.

              TIA,

              Phpkid

              Comment

              • phpkid
                Junior Member
                • Sep 2004
                • 14

                #8
                Update

                Well I set the Timeout vaule to 120 and I am still getting "network errors" in zabbix log and false pages. How does the zabbix poller determine an error? I even saw "no route to host errors" which were totally false as I was pinging the same server during this time without issue. Also it does not seem to be taking the 2min(120sec) setting into account as the servers are totally reachable during this time. Can anyone suggest anything?

                Thanks,

                Phpkid

                Comment

                • elkor
                  Senior Member
                  • Jul 2005
                  • 299

                  #9
                  I honestly don't know kid. something sounds screwey but I'll be damned if I know what. I have beta8 installed and am not having this issue so it must be something with your environment

                  Comment

                  • Alexei
                    Founder, CEO
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Sep 2004
                    • 5654

                    #10
                    "Timeout while connecting to [Db2]" obvously means than ZABBIX server is
                    unable to establish connection to DB2 because of timeout.

                    Try to "telnet db2 10050" to see what it really means
                    Alexei Vladishev
                    Creator of Zabbix, Product manager
                    New York | Tokyo | Riga
                    My Twitter

                    Comment

                    • phpkid
                      Junior Member
                      • Sep 2004
                      • 14

                      #11
                      Update

                      Hehe....I see where you getting at Alexei. However, during the periods it says it cannot talk it still is getting data from the hosts as I have no gaps in any graphs for items. The lowest interval of an item i am checking is 20s. Am I wrong in thinking the hosts are alive if there are no gaps?

                      Thanks,

                      Phpkid

                      Comment

                      • phpkid
                        Junior Member
                        • Sep 2004
                        • 14

                        #12
                        Update again

                        Another update....still no go. I still get Timeouts listed in zabbix server debug log when the server can clearly get to the host. It happens on any of the host. I have had network engineers at both facilities run checks, and all pass, no network issues. BTW...I am running beta9 also. I have looked through the source code to see how zabbix determines the errors. Its been difficult and I am still confused (as I do not write code, but do understand some of it). Again this problem did not arise till I upgraded from beta4 to beta8. Is there a way to turn this functionality off so I can limp along without false pages? Or can some describe how zabbix calculates the host is unreachable, network errors, etc?

                        Any help would be greatly appreciated.......

                        Thanks,

                        Phpkid

                        Comment

                        • elkor
                          Senior Member
                          • Jul 2005
                          • 299

                          #13
                          I don't know man. I tend to blame intermittent connectivity issues on tcp/ip. bad arp tables, flakey switches, spanning tree convergence, etc.

                          sometimes the problems can be so fleeting that you actually can connect a millisecond or two later.

                          I'm really not sure as I'm not seeing this behavior on this end.

                          try forcing active=0 in the agents config file.. I've found that even though that's supposed to be the default they still try and check if you leave it undefined. perhaps your firewalls are blocking communications momentarily between server/client when this happens.

                          It's a real super longshot.. but hey nothing else seems to be working for you.

                          Comment

                          • phpkid
                            Junior Member
                            • Sep 2004
                            • 14

                            #14
                            I don't know man. I tend to blame intermittent connectivity issues on tcp/ip. bad arp tables, flakey switches, spanning tree convergence, etc.

                            sometimes the problems can be so fleeting that you actually can connect a millisecond or two later.

                            I'm really not sure as I'm not seeing this behavior on this end.

                            try forcing active=0 in the agents config file.. I've found that even though that's supposed to be the default they still try and check if you leave it undefined. perhaps your firewalls are blocking communications momentarily between server/client when this happens.

                            It's a real super longshot.. but hey nothing else seems to be working for you.
                            Elkor,
                            Did you mean "DisableActive=1" or is there another parm named "Active=0"?

                            Thanks,

                            Phpkid

                            Comment

                            • elkor
                              Senior Member
                              • Jul 2005
                              • 299

                              #15
                              Yeah that's the one, I didn't have the config infront of me

                              DisableActive=1 to shut it off.

                              Comment

                              Working...