Ad Widget

Collapse

Every agent failing at the same time

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • skibumatbu
    Junior Member
    • Oct 2014
    • 5

    #1

    Every agent failing at the same time

    Folks, I have a weird problem I don't know how to resolve. This happened twice to me so far. The first was on 4/30 and the second time just now. The only action I've taken was to enable debug logging level 4 on the server which I just did.

    I have my actions configured to send me emails when things fail. All of a sudden I get swamped with emails from every one of my agents at the same time. It looks like all my agent.ping items and SSH items fail all at once and then go back to normal a few minutes later.

    Couple of other symptoms...
    * 30+ hosts are reported as having issues, but I only see a few errors in the zabbix_server.log file.
    * I see a bunch of these:
    1534:20150507:132208.184 item "rdmdxinfra03.mdx.med:ssh.run[uptime]" became not supported: Cannot request a shell

    Maybe its a limits problem? Are there a ulimit recommendation for Zabbix?

    I'm ruling out the hypervisor that this is running on and the network it is connecting to. If we were having issues there all of our applications would be complaining as well.

    Any other thoughts on what it could be?
  • dthacker
    Member
    • Feb 2014
    • 42

    #2
    Take a look at this
    Unstable Responses from SSH agent. Do you have a lot of SSH agents running at short intervals?

    Dave

    Comment

    • timbo
      Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2013
      • 50

      #3
      Hi skibumatbu,

      Do all the Zabbix Agents perform a check against one remote host? Or do all of the Agents perform the exact same check as another Agent?

      Perhaps you could set up an Agent with only the three basic Zabbix Agent checks, to see if it fails with all the other agents?

      Anyway, I'm thinking you may have an Agent "Timeout" issue:


      The default "Timeout" setting for Zabbix Agents is 3 seconds (which apparently should not be extended).

      If your Agents are all attempting to perform a check on the same server/service, then that server/service stalls (and forces the Agents to take longer that 3 seconds to process), all the Agents will not have enough time to process all requests, and thus will not send some data to the Zabbix Server.

      Checkout http://www.zabbix.org/wiki/Troubleshooting (search for "Timeout")
      (The website is down at the moment, but Google has it cached)

      Also note that if you set Agent debugging to level 4, it will help identify Timeout issues.

      Hope this helps!

      -Timbo

      Comment

      • skibumatbu
        Junior Member
        • Oct 2014
        • 5

        #4
        Thanks both of you for the replies... It sounds like you need a bit more details around the size of my environment, and the checks I have running from the server that failed yesterday

        First, I have 52 servers total. A roughly even Linux / Windows split... This will grow to about 600 servers (95% Linux) as I grow the Zabbix installation (away from Nagios which I have currently) and get through the usual growing pains.

        Second, each server is configured with just basic monitoring for right now... Anything that can be done from the agent directly is done from the agent (active mode). None of those alerted yesterday. Linux boxes get an SSH.run (from the Zabbix server) which runs the uptime command on the host. Both Linux and Windows get an agent.ping item. Both of those are used to identify if the host is up an running and not somehow braindead. All are configured to be tested every minute. There are a total of 504 items. What happened yesterday affected both the ssh.run and the agent.ping items. All at the same time.

        So, given the above... To answer Timbo's thought... Is the issue a result of the agent trying to return a result? Or is it the server trying to run commands itself and failing? I can probably increase the timeout, just not sure if it will help in this case.

        Thanks.

        Comment

        • timbo
          Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Sep 2013
          • 50

          #5
          Hi skibumatbu,

          I wouldn't increase the timers just yet, best to check the agent logs to see if they were in fact timing out. If you increase the timers on the Agents, they may tie up the pollers on the server for longer which could introduce more problems.

          Speaking of pollers, have you checked the Zabbix Internal Checks Items?

          The most common internal checks are included in the "Template App Zabbix Server" template, which (I think) is added to your "Zabbix server" host on install.

          Anyway, the template has a number of Items, Triggers and Graphs. Check out the following graphs:
          Zabbix cache usage, % free
          Zabbix data gathering process busy %
          Zabbix internal process busy %
          Zabbix server performance

          Hope this helps!

          -Timbo

          Comment

          • skibumatbu
            Junior Member
            • Oct 2014
            • 5

            #6
            Thanks Timbo... I wasn't using the template before, but I turned it on this morning. If this does happen again (and it should since we haven't really changed anything other than turning on debug logging on the server) I'll look at the graphs...

            So far the graphs aren't showing any issues. Nor are any of the other items being monitored in the template.

            Comment

            Working...