Ad Widget

Collapse

Zabbix agent closes connection before answering

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • moses.moore
    Junior Member
    • Dec 2014
    • 24

    #1

    Zabbix agent closes connection before answering

    This is an intermittent problem, but it's gotten worse lately, and it's happening on multiple servers.

    In the zabbix_server.log file, I'm seeing many of these messages:
    > Zabbix agent item "system.cpu.load[percpu,avg15]" on host "wings.XXXXXX" failed: first network error, wait for 15 seconds
    > resuming Zabbix agent checks on host "wings.XXXXXX": connection restored

    It fails often, many times each minute, but not every time. And this is happening to many hosts.

    While this was going on, I went to the agent and set DebugLevel=4 so I could watch, and I sent some baloney requests so I could observe responses, maybe see an error that the zabbix server isn't reporting. This is what I saw:

    root@meter:~/tastetest# (echo "goofball"; sleep 1 |telnet wings.XXXXXX 6982
    Trying 70.32.115.64...
    Connected to wings.XXXXX.
    Escape character is '^]'.
    ZBXD&ZBX_NOTSUPPORTEDUnsupported item key.Connection closed by foreign host.
    root@meter:~/tastetest# (echo "goofball"; sleep 1 |telnet wings.XXXXXX 6982
    Trying 70.32.115.64...
    Connected to wings.XXXXX..
    Escape character is '^]'.
    ZBXD&ZBX_NOTSUPPORTEDUnsupported item key.Connection closed by foreign host.
    root@meter:~/tastetest# (echo "goofball"; sleep 1 |telnet wings.XXXXXX 6982
    Trying 70.32.115.64...
    Connected to wings.XXXXXX.
    Escape character is '^]'.
    Connection closed by foreign host.

    That third time, the connection was closed before I got a response from the agent. I checked the zabbix_agent.log, and I saw that it received all three of the "goofball" requests, so I know it's not the zabbix server failing to deliver the request, but the agent is failing to send a response before terminating the connection.

    31259:20150518:154535.885 listener #1 [processing request]
    31259:20150518:154535.885 Requested [goofball]
    31259:20150518:154535.885 listener #1 [waiting for connection]
    31257:20150518:154536.792 collector [processing data]
    31257:20150518:154536.792 In update_cpustats()
    31257:20150518:154536.792 End of update_cpustats()
    31257:20150518:154536.793 collector [idle 1 sec]
    31260:20150518:154537.245 listener #2 [processing request]
    31260:20150518:154537.245 Requested [goofball]
    31260:20150518:154537.245 listener #2 [waiting for connection]
    31257:20150518:154537.793 collector [processing data]
    31257:20150518:154537.793 In update_cpustats()
    31257:20150518:154537.793 End of update_cpustats()
    31257:20150518:154537.793 collector [idle 1 sec]
    31261:20150518:154538.460 listener #3 [processing request]
    31261:20150518:154538.462 Requested [goofball]
    31261:20150518:154538.463 listener #3 [waiting for connection]

    How can I get the zabbix agent to always sent an answer to requests it receives, instead of this behaviour where it seems to terminate the connection before sending an answer?

    EDIT: I couldn't post this because "Too many live links/images found in your post content." despite the fact that I had zero URLs in this text. Removed any mention of the string dot-cee-oh-em.
  • moses.moore
    Junior Member
    • Dec 2014
    • 24

    #2
    I've already tried increasing the number of StartAgents= in the zabbix_agent.conf . And Im using a bogus key for requests, so I know increasing Timeout= won't help because the bogus key doesn't do any work, it should just respond immediately with a 'not supported' error message.

    This problem of getting no answer is happening with bogus keys and with genuine keys for requests; I'm just using the bogus keys because it's easier to find in the logfiles when I'm diagnosing.

    Comment

    • Atsushi
      Senior Member
      • Aug 2013
      • 2028

      #3
      If you want get value from Zabbix agent, Please use zabbix_get command.

      ex.
      $ zabbix_get -s <host ip or name> -p <port no> -k agent.version

      Comment

      • moses.moore
        Junior Member
        • Dec 2014
        • 24

        #4
        > use zabbix_get

        Yeah I already know that, and I already tried that. I was getting no response when I used zabbix_get, not even an error message. So I resorted to using telnet so that I could see the error message, or if I was getting bad data as an answer. zabbix_get wont tell me if I get bad data as an answer, or if the connection was reset by my side, or reset by the far side.

        zabbix_get was not helpful in diagnosing this problem; I had to resort to using something lower-level that wouldn't discard the information I need to find out what's going on.

        Comment

        • Atsushi
          Senior Member
          • Aug 2013
          • 2028

          #5
          Zabbix agent and the Zabbix server are communicating by using a proprietary protocol.
          It can be tested by using the telnet, you can test only TCP/IP level.

          Many of the causes of first network error, is the network configuration issues or timeout for get value.

          Please check whether the response by how much time is returned using the zabbix_get command.
          Default timeout is 3 sec between Zabbix server and agent.
          If you want change more long time, you can change Timeout in zabbix_server.conf and zabbix_agentd.conf.

          Comment

          • moses.moore
            Junior Member
            • Dec 2014
            • 24

            #6
            > using a proprietary protocol

            TCP/IP is not a proprietary protocol.

            > is the network configuration issues or timeout for get value

            I've already proved it is not network configuration by communicating using alternate tools for identical sessions. and the problems are intermittent, despite no changes in network configuration.

            There should be no timeout for bad keys; the zabbix agent should respond with an error message immediately. When error responses arrive they arrive before 3 seconds elapsed, and when connections are dropped it happens before 3 seconds elapse.

            Before you ask: This problem happens with correct keys and with incorrect keys. I'm using incorrect keys for testing to make sure that timeout **isn't** part of the problem.

            EDIT: before someone asks, two of the machines I'm having this problem with have load averages <0.1, so it's not as if the zabbix agent doesn't have enough resources to do a key lookup.

            What can I do to investigate the problem further? Should I resort to running zabbix_agent inside strace and monitor every system call? Is there a way to get more debugging information than DebugLevel=4 ?
            Last edited by moses.moore; 19-05-2015, 17:38.

            Comment

            • kloczek
              Senior Member
              • Jun 2006
              • 1771

              #7
              Originally posted by moses.moore
              I've already tried increasing the number of StartAgents= in the zabbix_agent.conf.
              From comment in configuration file:
              Code:
              ### Option: StartAgents
              #       Number of pre-forked instances of zabbix_agentd that process passive checks.
              #       If set to 0, disables passive checks and the agent will not listen on any TCP port.
              Do you see in above word "passive"?
              If you will switch to active monitoring you can change this variable to
              StartAgents=0 (I'm using such setting on my mid scale env with about 150k items and 2.7k NVPS).
              Using active agent setup, active proxies and active items is only way to have best possible scalability of whole zabbix monitoring infrastructure beyond point when someone is considering to change StartAgents agent settings.

              Passive monitoring and using passive items ("zabbix agent" instead "zabbix agent (active) works only up to some relatively small scale. After this fiddling around things like StartAgents is only delaying some unavoidable changes. Cost of avoiding switching to active monitoring sooner or later will be completely not acceptable. It is only matter of time and how quickly is growing list of monitored items/hosts when you will be forced to make those (very easy) changes.
              Remember that each proxy or server thread is keeping own connection to database backend.
              In my case I have only 75 active constantly connections to main database. Why? Because I'm not using at all passive items. Number of such connections would be even lower if I would be not forces to use passive connections to few proxies.
              More threads hanging in running queue-> higher probability that during context switching to next thread everything what was in CPU cashes needs to dropped and CPU will be forced to wait few hundredths of CPU cycles on delivery some pages from RAM. In such cased you will be able to see strange effect that effective CPU usage will be relatively low but in the same time zabbix server or proxy will be slow.
              Last edited by kloczek; 19-05-2015, 18:55.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment

              • moses.moore
                Junior Member
                • Dec 2014
                • 24

                #8
                > if you are thinking of increasing the StartAgents= setting, you should seriously think about using active zabbix_agents instead of passive zabbix_agents

                I think you are right, and I will do that eventually. But the problem I describe is with the zabbix_agent software cutting off the connection before sending an answer to a request, not a problem with the zabbix server taking too long to fetch all the information from remote agents.

                Comment

                • kloczek
                  Senior Member
                  • Jun 2006
                  • 1771

                  #9
                  Originally posted by moses.moore
                  > if you are thinking of increasing the StartAgents= setting, you should seriously think about using active zabbix_agents instead of passive zabbix_agents

                  I think you are right, and I will do that eventually. But the problem I describe is with the zabbix_agent software cutting off the connection before sending an answer to a request, not a problem with the zabbix server taking too long to fetch all the information from remote agents.
                  Think one more time .. try to notice that you are talking about connectivity initiated from zabbix server/proxy side-> Ergo: it is passive connection :P

                  In case using active agents setup every agent is asking for configuration and is sending collected data in batches. Such strategy make sense specially with growing number of monitored items on host where agent is running.
                  In extreme passive monitoring scenario sending back monitored data is not possible only because proxy/server is spending so much time on context switches between threads that some sessions will be dropped on passing timeout.
                  In case using active agent setup no matter how many items is monitored over agent you will have guarantee that between exact agent and server/proxy will be only one connection and this connection will be initiated from agent side. Rate per second of such requests coming from all agents from server/proxy will be so low that server/proxy will have only few active connections with even few hundredths agents.

                  Of course it is some hidden cost of above. It means that avg latency of collecting some metrics and testing them against triggers definitions is increasing. Default proxy configuration is using
                  DataSenderFrequency=1 which means that proxy all collected data from agents is pushing to the server every second. I'm using for example on active proxies DataSenderFrequency=10 which means that total latency between moment when item will be sampled is DataSenderFrequency(=10s on proxy)+BufferSend(default =5s and I'm using default settings on agents side)=15s.
                  Only in some real time monitoring such delay is not acceptable and always is possible to separate hosts which should have alarming with lower latency to put them on proxy with DataSenderFrequency=1 and agents with agents with BufferSend=1.
                  I'm 100% sure that in 99.9% zabbix environments even 30s delay on testing monitoring data against alarm/triggers definition is fully acceptable
                  Most of the NOC guys sitting on front of zabbix panels is not able to react with seconds .. few min is normal "latency" generated on "human layer" :P

                  Forming bigger batches of monitoring data means that on accepting those data zabbix server may push everything to main database backed in bigger batches using less insert/s queries. As consequence it mean that more data can be committed to the database using less physical IO/s (bandwidth in bytes/s will be almost the same but it will be more longer sequential IOs)

                  Everything is matter of keeping whole monitoring in some king good balance. However to spot where is this GoodBalance(tm) point you must be aware how everything is working .. and trust me most of the details are not obvious which means that typical person responsible for running smoothly zabbix monitoring must digest some batches of knowledge about this "beast" which must be somehow harnessed
                  Last edited by kloczek; 19-05-2015, 19:53.
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment

                  • moses.moore
                    Junior Member
                    • Dec 2014
                    • 24

                    #10
                    Originally posted by kloczek
                    Think one more time .. try to notice that you are talking about connectivity initiated from zabbix server/proxy side-> Ergo: it is passive connection :P
                    I'm not arguing that point; you seem to think this is an extreme passive monitoring situation, but this is not. I am examining one zabbix_agent each time, and each of these zabbix_agents only have 50-60 items, and when I am asking for only one key that is not any of these items (thus requiring no processing time) I am still having the problem.

                    proxy/server is spending so much time on context switches between threads that some sessions will be dropped on passing timeout.
                    How many times do I have to say this: The connections are dropped before the timeout expires. and I am not talking about zabbix_server I'm talking about zabbix_agent.

                    Your advice is sagacious, but it is irrelevant to the problem I am having.

                    Comment

                    • kloczek
                      Senior Member
                      • Jun 2006
                      • 1771

                      #11
                      Originally posted by moses.moore
                      How many times do I have to say this: The connections are dropped before the timeout expires. and I am not talking about zabbix_server I'm talking about zabbix_agent.
                      Sorry I'm giving you kind of shorthand version what I'm trying to tell you with in the same time some other facts presenting bigger bigger/wider picture
                      Again .. my fault.

                      All what I can tell you is that if you will switch to active agent setup and active items testing agent by using telnet on agent port will not have any sense because such connection will be not used to pass any monitoring or configuration data or initiate monitoring of exact item.
                      Only case when this communication may be still use (with active setup) will be on monitoring agent by simple check using agent.ping[] key.

                      Again .. you are trying to fix passive monitoring issues and you diagnostics data really shows that it is really some issue here but .. this is not you real/main problem (!!!)
                      You main problem is that you are using passive communication between zabbix server/proxy and agents and passive items.

                      Stop worrying about agent passive communication issues and start thinking about switching ASAP to active monitoring.

                      To introduce this you must on first step transform all your passive zabbix items ("zabbix agent" type) to active one ("zabbix agent (active)")
                      Second step is switching in agent setup from using Server=<your_srv_or_prx> to ServerActive=<your_srv_or_prx>.
                      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                      https://kloczek.wordpress.com/
                      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                      My zabbix templates https://github.com/kloczek/zabbix-templates

                      Comment

                      Working...