Ad Widget

**Markku** · 16-01-2020, 18:42

I would recommend monitoring the Zabbix server and Zabbix proxies with the official Zabbix templates, they provide statistics for Zabbix internal metrics like how much CPU each poller type consumes. Looking at those statistics may provide you some hints for the error reason.

For example, in one of my environments today, a non-redundant WLC (wireless LAN controller) was down, which caused the discovered 30+ access point hosts to be not responding to SNMP. That caused the Zabbix unreachable poller to take 100% CPU for the duration of the WLC outage (with the default StartPollersUnreachable=1 setting). Initially also the normal poller went to 100% CPU, which caused the few passive agents in the system to go "unreachable" for a moment, and network errors were shown in the logs. Based on these Zabbix internal statistics I could conclude that there was actually no network problem at all for those passive hosts, but the messages were caused by the poller being too heavily loaded for a moment.

Markku

**usbpc** · 16-01-2020, 23:41

Thank you very much for the detailed response!

Unfortunatly I've already done that and adjusted the settings accordingly. When I started looking into it that was part of the problem, but I've increased it so the data pollers are around 55% in normal operation. I've also increased the amount of Unreachable Pollers.

This is also happening while all devices are working completly normally. There are no problems, just Zabbix has connection problems, somtimes.

Let me try to describe the Problem again in a bit more detail:

The Zabbix Server itself dosen't connect to any agents, it is only connected to proxies. The Proxies are doing all the connecting to agents.
All the items that I'm monitoring on all hosts work and give me back values when using zabbix_get. They also work ~95% of the time when the proxies try to get the items. Sometimes it happens, that the Proxy has a Problem getting the item once, resulting in the error:

Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds

Also turning on Debug Level Logs it produces:

Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [4] Interrupted system call

This happens quite regularly, resulting in the Problems as described above. It works good enough that Zabbix is still useful, but not perfectly.

**usbpc** · 17-01-2020, 10:04

Just to give a Idea about what I have done so far:

Saw a lot of "Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds" messages in the Proxy logs and got "Zabbix agent on {HOST.NAME} is unreachable for 5 minutes" warnings from the "Template App Zabbix Agent".
Checked CPU/RAM/Disk/Network of the machine where the proxy is running
Used the Zabbix templates to monitor Proxy/Server health
1. Interesting to note here is that the missing Items are Reported on the Server by the Templates, although the is not directly connecting to any clients. Also in the "Administration > Queue" Overview, the Items are correctly shown on the Proxies that are monitoring them.
Got more detailed logs using "zabbix_server --runtime-control log_level_increase" to figure out more closely what is going wrong.
1. This got me the more detailed error message "Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [4] Interrupted system call

That are all the troubelshooting steps I've taken so far. I've also changed some items from beeing passive checks, to active agent checks, and that has for now reduced the Errors to where the unreachable Problem message are meaningful again.

But this still leaves the root problem that isn't solved. I'm still seeing the "Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds" messages in my logs, the items that it's trying to get beeing all kind of diffrent item, I'm gonna list some of them here so you got an Idea:

agent.ping
system.users.num
vfs.fs.size[/,pfree]
system.cpu.util[,system]
net.if.in[eth0]
vm.memory.size[available]
system.localtime

And more, but it is important to note, it's not only one host, it are many different hosts with many different items that are failing. Even from the Docker Container where the Proxy is running to the agent on the Linux where docker is installed is getting the same errors!

**Markku** · 17-01-2020, 19:04

I have to say that in my environments I've never had more than a couple of hosts with passive agents, I only use active agents, so I don't have similar experiences. In my systems the visual Queue list stays practically zero all the time. I do have lots of ping and SNMP polling on the server (no proxy), no problems with them. About 250 NVPS is the biggest system I have.

How many NVPS do the proxy have?

Hopefully someone else can share some experiences with more passive agents, what are the practical limits. The recommendation always is to use active agents if at all possible.

Markku

**usbpc** · 17-01-2020, 21:07

Is it noted somewhere that active agents are better? Because all my agents are setup with the settings ready for Active Items, I can just switch them over without much work.

I'm completely self taught in regards to zabbix, and I assumed, that the pre-made templates shipped with zabbix follow best practice, most of those are passive afaik. But I don't think that this many errors are normal either way.

My VPS value under Administration > Proxies is just about 175 for my biggest Proxy. But I also had it happen with a brand new Proxy with just one agent, a agent on the same host, but outside of docker with just the Proxy template applied.

I'm getting the feeling that the problem is with my Proxies beeing in Docker Containers.

**Markku** · 17-01-2020, 21:27

I would recommend reading https://www.packtpub.com/networking-...-third-edition. It is commonly mentioned principle that having Zabbix scheduling the polling for agents is not as good as having agents send the data by themselves (= active agents).

About the bundled templates, they are usually really just examples, they are not necessarily tuned for large environments. Your monitoring requirements can be very different from those configured in the templates. Which items you collect from your agents and how often, that should be determined by your business (or technical) needs.

I believe the new templates in https://git.zabbix.com/projects/ZBX/...owse/templates are better than usually included in packages, but I haven't yet checked them.

I'll just say that 175 NVPS per proxy is already quite big, just to get someone else to correct me and to tell more about agent usage recommendations

Markku

**Markku** · 17-01-2020, 21:28

And yeah I haven't run Zabbix in containers so cannot comment on those.

Markku

**usbpc** · 17-01-2020, 22:23

What does that book offer in addition to the official documentation?

Because I've read through the complete documentation to try and understand zabbix. I've also written a lodable module to monitor one of our systems, and have already been using the jsonrpc api in order to combine data in ways the zabbix webinterface dosen't allow to easily do.

I feel like with all that I have a pretty good grasp on zabbix, I'm just lacking the experience to know where my performance problems are coming from.

**Markku** · 17-01-2020, 22:47

I feel you usually get some extra from the books (compared to the online docs), like comments and solutions based on real-life experiences from the writers. Btw there is also a book about Zabbix performance tuning. The downside for the books is that they are not necessarily written for the latest available Zabbix version.

If you are able to easily get the errors with proxy in container with even only one agent, testing with a normal VM would be beneficial in finding the cause for the problems. But anyway, active agent connections will also transfer more item data in each connection, thus lowering the load getting the item data in.

Markku

**usbpc** · 17-01-2020, 23:01

But anyway, active agent connections will also transfer more item data in each connection, thus lowering the load getting the item data in.

Oh, that's interesting. Looking through the docs for 4.4 it seems that Agent 2.0 will also be more efficient: https://www.zabbix.com/documentation...oncepts/agent2

Yea, I should probably test around without containers, it was kinda a situation where I learned about containers and how easy it is to manage the configurations and dependencies.... and then I just applied it to everything else I did.

Because using docker makes setting up new proxies super easy, and it will be something I'll do quite often for my company.

**usbpc** · 18-01-2020, 23:23

Just looking around a bit I have found something that might be the problem: https://tech.xing.com/a-reason-for-u...r-abd041cf7e02

I'll use wireshark and take some network dumps to see if that could be the culprit.

**usbpc** · 03-02-2020, 11:14

Now I finally got some time to go check it out. But the Problem is different from what I thought and also different from when I last looked at it... maybe I also just looked at it wrong the last time, idk.

What I have found is that the error I'm getting now looks the same until I enable more debugging. But instead of

Code:

Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [4] Interrupted system call

I now get:

Code:

Get value from agent failed: cannot connect to [[{IP of SOME HOST}]:10050]: [111] Connection refused

Digging a bit into a network dump I clearly see, that all communication has been working as intended, but the Agent decided it randomly didn't want the connection:

Code:

SYN ->
RST, ACK <-

I see those properly on both the Docker interface as well as the host interface. So NAT dosen't seem to be my problem. But somehow some of the Agents, or the hosts where the Agents are installed on refuse the connection. But not always, just sometimes.

**usbpc** · 03-02-2020, 15:23

I found the problem that I had now.

It way my stupidity configuring the Agents wrong. Systemd expected a pid file at

Code:

/run/zabbix/zabbix_agentd.pid

I however had a config file copied from somwhere writing the pid file to

Code:

/tmp/zabbix_agentd.pid

.

That resulted in systemd killing the Agent processes after some timeout and then restarting. If in that short time window where zabbix agent was not actually running the proxy tried to get a value it would get a network error. Thus giving me the log message

HTML Code:

[111] Connection refused

.

I have no idea where the

HTML Code:

[4] Interrupted system call

error message came from, and why I don't have it anymore. If it comes back I'll go investigate more.

Ad Widget

Getting a high amount of unreachable hosts and Network errors in local Network

Getting a high amount of unreachable hosts and Network errors in local Network

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment