Ad Widget

**dthacker** · 08-05-2015, 00:01

Take a look at this
Unstable Responses from SSH agent. Do you have a lot of SSH agents running at short intervals?

Dave

**timbo** · 08-05-2015, 04:29

Hi skibumatbu,

Do all the Zabbix Agents perform a check against one remote host? Or do all of the Agents perform the exact same check as another Agent?

Perhaps you could set up an Agent with only the three basic Zabbix Agent checks, to see if it fails with all the other agents?

Anyway, I'm thinking you may have an Agent "Timeout" issue:

3 Zabbix agent (UNIX)

https://www.zabbix.com/documentation/2.4/manual/appendix/config/zabbix_agentd

The default "Timeout" setting for Zabbix Agents is 3 seconds (which apparently should not be extended).

If your Agents are all attempting to perform a check on the same server/service, then that server/service stalls (and forces the Agents to take longer that 3 seconds to process), all the Agents will not have enough time to process all requests, and thus will not send some data to the Zabbix Server.

Checkout http://www.zabbix.org/wiki/Troubleshooting (search for "Timeout")
(The website is down at the moment, but Google has it cached)

Also note that if you set Agent debugging to level 4, it will help identify Timeout issues.

Hope this helps!

-Timbo

**skibumatbu** · 08-05-2015, 19:24

Thanks both of you for the replies... It sounds like you need a bit more details around the size of my environment, and the checks I have running from the server that failed yesterday

First, I have 52 servers total. A roughly even Linux / Windows split... This will grow to about 600 servers (95% Linux) as I grow the Zabbix installation (away from Nagios which I have currently) and get through the usual growing pains.

Second, each server is configured with just basic monitoring for right now... Anything that can be done from the agent directly is done from the agent (active mode). None of those alerted yesterday. Linux boxes get an SSH.run (from the Zabbix server) which runs the uptime command on the host. Both Linux and Windows get an agent.ping item. Both of those are used to identify if the host is up an running and not somehow braindead. All are configured to be tested every minute. There are a total of 504 items. What happened yesterday affected both the ssh.run and the agent.ping items. All at the same time.

So, given the above... To answer Timbo's thought... Is the issue a result of the agent trying to return a result? Or is it the server trying to run commands itself and failing? I can probably increase the timeout, just not sure if it will help in this case.

Thanks.

**timbo** · 11-05-2015, 03:02

Hi skibumatbu,

I wouldn't increase the timers just yet, best to check the agent logs to see if they were in fact timing out. If you increase the timers on the Agents, they may tie up the pollers on the server for longer which could introduce more problems.

Speaking of pollers, have you checked the Zabbix Internal Checks Items?

The most common internal checks are included in the "Template App Zabbix Server" template, which (I think) is added to your "Zabbix server" host on install.

Anyway, the template has a number of Items, Triggers and Graphs. Check out the following graphs:
Zabbix cache usage, % free
Zabbix data gathering process busy %
Zabbix internal process busy %
Zabbix server performance

Hope this helps!

-Timbo

**skibumatbu** · 11-05-2015, 16:04

Thanks Timbo... I wasn't using the template before, but I turned it on this morning. If this does happen again (and it should since we haven't really changed anything other than turning on debug logging on the server) I'll look at the graphs...

So far the graphs aren't showing any issues. Nor are any of the other items being monitored in the template.

Ad Widget

Every agent failing at the same time

Every agent failing at the same time

Comment

Comment

Comment

Comment

Comment