Ad Widget

**moses.moore** · 18-05-2015, 22:15

I've already tried increasing the number of StartAgents= in the zabbix_agent.conf . And Im using a bogus key for requests, so I know increasing Timeout= won't help because the bogus key doesn't do any work, it should just respond immediately with a 'not supported' error message.

This problem of getting no answer is happening with bogus keys and with genuine keys for requests; I'm just using the bogus keys because it's easier to find in the logfiles when I'm diagnosing.

**Atsushi** · 19-05-2015, 03:44

If you want get value from Zabbix agent, Please use zabbix_get command.

ex.
$ zabbix_get -s <host ip or name> -p <port no> -k agent.version

**moses.moore** · 19-05-2015, 07:10

> use zabbix_get

Yeah I already know that, and I already tried that. I was getting no response when I used zabbix_get, not even an error message. So I resorted to using telnet so that I could see the error message, or if I was getting bad data as an answer. zabbix_get wont tell me if I get bad data as an answer, or if the connection was reset by my side, or reset by the far side.

zabbix_get was not helpful in diagnosing this problem; I had to resort to using something lower-level that wouldn't discard the information I need to find out what's going on.

**Atsushi** · 19-05-2015, 08:11

Zabbix agent and the Zabbix server are communicating by using a proprietary protocol.
It can be tested by using the telnet, you can test only TCP/IP level.

Many of the causes of first network error, is the network configuration issues or timeout for get value.

Please check whether the response by how much time is returned using the zabbix_get command.
Default timeout is 3 sec between Zabbix server and agent.
If you want change more long time, you can change Timeout in zabbix_server.conf and zabbix_agentd.conf.

**moses.moore** · 19-05-2015, 17:09

> using a proprietary protocol

TCP/IP is not a proprietary protocol.

> is the network configuration issues or timeout for get value

I've already proved it is not network configuration by communicating using alternate tools for identical sessions. and the problems are intermittent, despite no changes in network configuration.

There should be no timeout for bad keys; the zabbix agent should respond with an error message immediately. When error responses arrive they arrive before 3 seconds elapsed, and when connections are dropped it happens before 3 seconds elapse.

Before you ask: This problem happens with correct keys and with incorrect keys. I'm using incorrect keys for testing to make sure that timeout **isn't** part of the problem.

EDIT: before someone asks, two of the machines I'm having this problem with have load averages <0.1, so it's not as if the zabbix agent doesn't have enough resources to do a key lookup.

What can I do to investigate the problem further? Should I resort to running zabbix_agent inside strace and monitor every system call? Is there a way to get more debugging information than DebugLevel=4 ?

**kloczek** · 19-05-2015, 18:43

Originally posted by moses.moore

I've already tried increasing the number of StartAgents= in the zabbix_agent.conf.

From comment in configuration file:

Code:

### Option: StartAgents
#       Number of pre-forked instances of zabbix_agentd that process passive checks.
#       If set to 0, disables passive checks and the agent will not listen on any TCP port.

Do you see in above word "passive"?
If you will switch to active monitoring you can change this variable to
StartAgents=0 (I'm using such setting on my mid scale env with about 150k items and 2.7k NVPS).
Using active agent setup, active proxies and active items is only way to have best possible scalability of whole zabbix monitoring infrastructure beyond point when someone is considering to change StartAgents agent settings.

Passive monitoring and using passive items ("zabbix agent" instead "zabbix agent (active) works only up to some relatively small scale. After this fiddling around things like StartAgents is only delaying some unavoidable changes. Cost of avoiding switching to active monitoring sooner or later will be completely not acceptable. It is only matter of time and how quickly is growing list of monitored items/hosts when you will be forced to make those (very easy) changes.
Remember that each proxy or server thread is keeping own connection to database backend.
In my case I have only 75 active constantly connections to main database. Why? Because I'm not using at all passive items. Number of such connections would be even lower if I would be not forces to use passive connections to few proxies.
More threads hanging in running queue-> higher probability that during context switching to next thread everything what was in CPU cashes needs to dropped and CPU will be forced to wait few hundredths of CPU cycles on delivery some pages from RAM. In such cased you will be able to see strange effect that effective CPU usage will be relatively low but in the same time zabbix server or proxy will be slow.

**moses.moore** · 19-05-2015, 19:05

> if you are thinking of increasing the StartAgents= setting, you should seriously think about using active zabbix_agents instead of passive zabbix_agents

I think you are right, and I will do that eventually. But the problem I describe is with the zabbix_agent software cutting off the connection before sending an answer to a request, not a problem with the zabbix server taking too long to fetch all the information from remote agents.

**kloczek** · 19-05-2015, 19:47

Originally posted by moses.moore

> if you are thinking of increasing the StartAgents= setting, you should seriously think about using active zabbix_agents instead of passive zabbix_agents

I think you are right, and I will do that eventually. But the problem I describe is with the zabbix_agent software cutting off the connection before sending an answer to a request, not a problem with the zabbix server taking too long to fetch all the information from remote agents.

Think one more time .. try to notice that you are talking about connectivity initiated from zabbix server/proxy side-> Ergo: it is passive connection :P

In case using active agents setup every agent is asking for configuration and is sending collected data in batches. Such strategy make sense specially with growing number of monitored items on host where agent is running.
In extreme passive monitoring scenario sending back monitored data is not possible only because proxy/server is spending so much time on context switches between threads that some sessions will be dropped on passing timeout.
In case using active agent setup no matter how many items is monitored over agent you will have guarantee that between exact agent and server/proxy will be only one connection and this connection will be initiated from agent side. Rate per second of such requests coming from all agents from server/proxy will be so low that server/proxy will have only few active connections with even few hundredths agents.

Of course it is some hidden cost of above. It means that avg latency of collecting some metrics and testing them against triggers definitions is increasing. Default proxy configuration is using
DataSenderFrequency=1 which means that proxy all collected data from agents is pushing to the server every second. I'm using for example on active proxies DataSenderFrequency=10 which means that total latency between moment when item will be sampled is DataSenderFrequency(=10s on proxy)+BufferSend(default =5s and I'm using default settings on agents side)=15s.
Only in some real time monitoring such delay is not acceptable and always is possible to separate hosts which should have alarming with lower latency to put them on proxy with DataSenderFrequency=1 and agents with agents with BufferSend=1.
I'm 100% sure that in 99.9% zabbix environments even 30s delay on testing monitoring data against alarm/triggers definition is fully acceptable

Most of the NOC guys sitting on front of zabbix panels is not able to react with seconds .. few min is normal "latency" generated on "human layer" :P

Forming bigger batches of monitoring data means that on accepting those data zabbix server may push everything to main database backed in bigger batches using less insert/s queries. As consequence it mean that more data can be committed to the database using less physical IO/s (bandwidth in bytes/s will be almost the same but it will be more longer sequential IOs)

Everything is matter of keeping whole monitoring in some king good balance. However to spot where is this GoodBalance(tm) point you must be aware how everything is working .. and trust me most of the details are not obvious which means that typical person responsible for running smoothly zabbix monitoring must digest some batches of knowledge about this "beast" which must be somehow harnessed

**moses.moore** · 19-05-2015, 20:28

Originally posted by kloczek

Think one more time .. try to notice that you are talking about connectivity initiated from zabbix server/proxy side-> Ergo: it is passive connection :P

I'm not arguing that point; you seem to think this is an extreme passive monitoring situation, but this is not. I am examining one zabbix_agent each time, and each of these zabbix_agents only have 50-60 items, and when I am asking for only one key that is not any of these items (thus requiring no processing time) I am still having the problem.

proxy/server is spending so much time on context switches between threads that some sessions will be dropped on passing timeout.

How many times do I have to say this: The connections are dropped before the timeout expires. and I am not talking about zabbix_server I'm talking about zabbix_agent.

Your advice is sagacious, but it is irrelevant to the problem I am having.

**kloczek** · 19-05-2015, 21:08

Originally posted by moses.moore

How many times do I have to say this: The connections are dropped before the timeout expires. and I am not talking about zabbix_server I'm talking about zabbix_agent.

Sorry I'm giving you kind of shorthand version what I'm trying to tell you with in the same time some other facts presenting bigger bigger/wider picture

Again .. my fault.

All what I can tell you is that if you will switch to active agent setup and active items testing agent by using telnet on agent port will not have any sense because such connection will be not used to pass any monitoring or configuration data or initiate monitoring of exact item.
Only case when this communication may be still use (with active setup) will be on monitoring agent by simple check using agent.ping[] key.

Again .. you are trying to fix passive monitoring issues and you diagnostics data really shows that it is really some issue here but .. this is not you real/main problem (!!!)
You main problem is that you are using passive communication between zabbix server/proxy and agents and passive items.

Stop worrying about agent passive communication issues and start thinking about switching ASAP to active monitoring.

To introduce this you must on first step transform all your passive zabbix items ("zabbix agent" type) to active one ("zabbix agent (active)")
Second step is switching in agent setup from using Server=<your_srv_or_prx> to ServerActive=<your_srv_or_prx>.

Ad Widget

Zabbix agent closes connection before answering

Zabbix agent closes connection before answering

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment