Ad Widget

Collapse

Getting a high amount of unreachable hosts and Network errors in local Network

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • usbpc
    Junior Member
    • Jan 2020
    • 9

    #1

    Getting a high amount of unreachable hosts and Network errors in local Network

    I have a medium sized Zabbix Setup. I have one Central Zabbix Server and Multiple Zabbix Proxies, one at each Site I'm monitoring. All of those are setup with the Official Docker Containers, the main Server:

    * postgres:11-alpine
    * zabbix/zabbix-web-nginx-pgsql:alpine-4.0-latest
    * zabbix/zabbix-snmptraps:alpine-4.0-latest
    * zabbix/zabbix-server-pgsql:alpine-4.0-latest

    The Proxies are all just a single Docker image:
    * zabbix/zabbix-proxy-sqlite3:ubuntu-4.0-latest

    The Proxies mostly monitor other VMs on in the same VMWare vCenter.

    The Problem that arises is that on the Proxies in the Logs I see a very high amount of network errors that all look somewhat like this:

    Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds

    From that it arises, that there is a High Amount of False Positive Problems in Zabbix. Mostly "Zabbix agent on SOME HOST is unreachable for 5 minutes", but sometimes also other Problems that are triggered by .nodata().

    There is also a high amount of missing item Data, since the hosts with network errors are considered "offline" for a bit and no items from them are checked.

    I've also tried to investigate it a bit and found the source code that produces this error: https://github.com/zabbix/zabbix/blo.../poller.c#L302

    Unfortunatly the same message seems to be triggerd in 3 different failure cases: https://github.com/zabbix/zabbix/blo.../poller.c#L749

    Therefore I couldn't really find out anything that way. I also of cause looked at cpu, ram, disk and network usage on the proxies and couldn't find anything that looked out of the norm for me.

    How should I proceed to find out the cause of these errors? Has anyone else had this happen to them?
    Last edited by usbpc; 03-02-2020, 16:39.
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1782

    #2
    I would recommend monitoring the Zabbix server and Zabbix proxies with the official Zabbix templates, they provide statistics for Zabbix internal metrics like how much CPU each poller type consumes. Looking at those statistics may provide you some hints for the error reason.

    For example, in one of my environments today, a non-redundant WLC (wireless LAN controller) was down, which caused the discovered 30+ access point hosts to be not responding to SNMP. That caused the Zabbix unreachable poller to take 100% CPU for the duration of the WLC outage (with the default StartPollersUnreachable=1 setting). Initially also the normal poller went to 100% CPU, which caused the few passive agents in the system to go "unreachable" for a moment, and network errors were shown in the logs. Based on these Zabbix internal statistics I could conclude that there was actually no network problem at all for those passive hosts, but the messages were caused by the poller being too heavily loaded for a moment.

    Markku

    Comment

    • usbpc
      Junior Member
      • Jan 2020
      • 9

      #3
      Thank you very much for the detailed response!

      Unfortunatly I've already done that and adjusted the settings accordingly. When I started looking into it that was part of the problem, but I've increased it so the data pollers are around 55% in normal operation. I've also increased the amount of Unreachable Pollers.

      This is also happening while all devices are working completly normally. There are no problems, just Zabbix has connection problems, somtimes.

      Let me try to describe the Problem again in a bit more detail:

      The Zabbix Server itself dosen't connect to any agents, it is only connected to proxies. The Proxies are doing all the connecting to agents.
      All the items that I'm monitoring on all hosts work and give me back values when using zabbix_get. They also work ~95% of the time when the proxies try to get the items. Sometimes it happens, that the Proxy has a Problem getting the item once, resulting in the error:


      Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds

      Also turning on Debug Level Logs it produces:

      Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [4] Interrupted system call


      This happens quite regularly, resulting in the Problems as described above. It works good enough that Zabbix is still useful, but not perfectly.



      Comment

      • usbpc
        Junior Member
        • Jan 2020
        • 9

        #4
        Just to give a Idea about what I have done so far:
        1. Saw a lot of "Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds" messages in the Proxy logs and got "Zabbix agent on {HOST.NAME} is unreachable for 5 minutes" warnings from the "Template App Zabbix Agent".
        2. Checked CPU/RAM/Disk/Network of the machine where the proxy is running
        3. Used the Zabbix templates to monitor Proxy/Server health
          1. Interesting to note here is that the missing Items are Reported on the Server by the Templates, although the is not directly connecting to any clients. Also in the "Administration > Queue" Overview, the Items are correctly shown on the Proxies that are monitoring them.
        4. Got more detailed logs using "zabbix_server --runtime-control log_level_increase" to figure out more closely what is going wrong.
          1. This got me the more detailed error message "Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [4] Interrupted system call
        That are all the troubelshooting steps I've taken so far. I've also changed some items from beeing passive checks, to active agent checks, and that has for now reduced the Errors to where the unreachable Problem message are meaningful again.

        But this still leaves the root problem that isn't solved. I'm still seeing the "Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds" messages in my logs, the items that it's trying to get beeing all kind of diffrent item, I'm gonna list some of them here so you got an Idea:
        • agent.ping
        • system.users.num
        • vfs.fs.size[/,pfree]
        • system.cpu.util[,system]
        • net.if.in[eth0]
        • vm.memory.size[available]
        • system.localtime
        And more, but it is important to note, it's not only one host, it are many different hosts with many different items that are failing. Even from the Docker Container where the Proxy is running to the agent on the Linux where docker is installed is getting the same errors!

        Comment

        • Markku
          Senior Member
          Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
          • Sep 2018
          • 1782

          #5
          I have to say that in my environments I've never had more than a couple of hosts with passive agents, I only use active agents, so I don't have similar experiences. In my systems the visual Queue list stays practically zero all the time. I do have lots of ping and SNMP polling on the server (no proxy), no problems with them. About 250 NVPS is the biggest system I have.

          How many NVPS do the proxy have?

          Hopefully someone else can share some experiences with more passive agents, what are the practical limits. The recommendation always is to use active agents if at all possible.

          Markku

          Comment

          • usbpc
            Junior Member
            • Jan 2020
            • 9

            #6
            Is it noted somewhere that active agents are better? Because all my agents are setup with the settings ready for Active Items, I can just switch them over without much work.

            I'm completely self taught in regards to zabbix, and I assumed, that the pre-made templates shipped with zabbix follow best practice, most of those are passive afaik. But I don't think that this many errors are normal either way.

            My VPS value under Administration > Proxies is just about 175 for my biggest Proxy. But I also had it happen with a brand new Proxy with just one agent, a agent on the same host, but outside of docker with just the Proxy template applied.

            I'm getting the feeling that the problem is with my Proxies beeing in Docker Containers.

            Comment

            • Markku
              Senior Member
              Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
              • Sep 2018
              • 1782

              #7
              I would recommend reading https://www.packtpub.com/networking-...-third-edition. It is commonly mentioned principle that having Zabbix scheduling the polling for agents is not as good as having agents send the data by themselves (= active agents).

              About the bundled templates, they are usually really just examples, they are not necessarily tuned for large environments. Your monitoring requirements can be very different from those configured in the templates. Which items you collect from your agents and how often, that should be determined by your business (or technical) needs.

              I believe the new templates in https://git.zabbix.com/projects/ZBX/...owse/templates are better than usually included in packages, but I haven't yet checked them.

              I'll just say that 175 NVPS per proxy is already quite big, just to get someone else to correct me and to tell more about agent usage recommendations

              Markku

              Comment

              • Markku
                Senior Member
                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                • Sep 2018
                • 1782

                #8
                And yeah I haven't run Zabbix in containers so cannot comment on those.

                Markku

                Comment

                • usbpc
                  Junior Member
                  • Jan 2020
                  • 9

                  #9
                  What does that book offer in addition to the official documentation?

                  Because I've read through the complete documentation to try and understand zabbix. I've also written a lodable module to monitor one of our systems, and have already been using the jsonrpc api in order to combine data in ways the zabbix webinterface dosen't allow to easily do.

                  I feel like with all that I have a pretty good grasp on zabbix, I'm just lacking the experience to know where my performance problems are coming from.

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1782

                    #10
                    I feel you usually get some extra from the books (compared to the online docs), like comments and solutions based on real-life experiences from the writers. Btw there is also a book about Zabbix performance tuning. The downside for the books is that they are not necessarily written for the latest available Zabbix version.

                    If you are able to easily get the errors with proxy in container with even only one agent, testing with a normal VM would be beneficial in finding the cause for the problems. But anyway, active agent connections will also transfer more item data in each connection, thus lowering the load getting the item data in.

                    Markku

                    Comment

                    • usbpc
                      Junior Member
                      • Jan 2020
                      • 9

                      #11
                      But anyway, active agent connections will also transfer more item data in each connection, thus lowering the load getting the item data in.
                      Oh, that's interesting. Looking through the docs for 4.4 it seems that Agent 2.0 will also be more efficient: https://www.zabbix.com/documentation...oncepts/agent2

                      Yea, I should probably test around without containers, it was kinda a situation where I learned about containers and how easy it is to manage the configurations and dependencies.... and then I just applied it to everything else I did.

                      Because using docker makes setting up new proxies super easy, and it will be something I'll do quite often for my company.

                      Comment

                      • usbpc
                        Junior Member
                        • Jan 2020
                        • 9

                        #12
                        Just looking around a bit I have found something that might be the problem: https://tech.xing.com/a-reason-for-u...r-abd041cf7e02

                        I'll use wireshark and take some network dumps to see if that could be the culprit.

                        Comment

                        • usbpc
                          Junior Member
                          • Jan 2020
                          • 9

                          #13
                          Now I finally got some time to go check it out. But the Problem is different from what I thought and also different from when I last looked at it... maybe I also just looked at it wrong the last time, idk.

                          What I have found is that the error I'm getting now looks the same until I enable more debugging. But instead of
                          Code:
                          Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [4] Interrupted system call
                          I now get:

                          Code:
                          Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [111] Connection refused
                          Digging a bit into a network dump I clearly see, that all communication has been working as intended, but the Agent decided it randomly didn't want the connection:

                          Code:
                          SYN ->
                          RST, ACK <-
                          I see those properly on both the Docker interface as well as the host interface. So NAT dosen't seem to be my problem. But somehow some of the Agents, or the hosts where the Agents are installed on refuse the connection. But not always, just sometimes.

                          Comment

                          • usbpc
                            Junior Member
                            • Jan 2020
                            • 9

                            #14
                            I found the problem that I had now.

                            It way my stupidity configuring the Agents wrong. Systemd expected a pid file at
                            Code:
                            /run/zabbix/zabbix_agentd.pid
                            I however had a config file copied from somwhere writing the pid file to
                            Code:
                            /tmp/zabbix_agentd.pid
                            .

                            That resulted in systemd killing the Agent processes after some timeout and then restarting. If in that short time window where zabbix agent was not actually running the proxy tried to get a value it would get a network error. Thus giving me the log message
                            HTML Code:
                            [111] Connection refused
                            .

                            I have no idea where the
                            HTML Code:
                            [4] Interrupted system call
                            error message came from, and why I don't have it anymore. If it comes back I'll go investigate more.

                            Comment

                            Working...