Ad Widget

**delija91p** · 29-07-2015, 21:25

I was having the same exact issue this whole week and last week. Something that seemed to help me out was increasing the Timeout parameter in zabbix_server.conf from the default 3 to 20. Ever since then, I haven't had any false triggers like that. Give that a shot!

**tchjts1** · 30-07-2015, 19:46

As you are adding new hosts and monitoring more and more items, are you also adjusting your settings in zabbix_server.conf to optimize allocated resources? This post can help you out: https://www.zabbix.com/forum/showthread.php?t=47781

Additionally, I personally would not set my "Agent unreachable" alert for such a small value as 2 minutes. It may work a little smoother for you if you go with 5 minutes.

**phpclub** · 02-08-2015, 13:27

Thanks for the answers

1. I havn't tried changing the Timeout setting, for some reason it doesn't make sense to change it. I will try it out.

2.
I am currently setting alerts for 3 minutes plus 90 seconds escalation time, so its not that.
BUT, I do want alerts after 2 minutes, otherwise zabbix is of no use.
I need alerts in realtime, not after the client calls.

3. The # of hosts/nvps is currently steady and is not expected to grow soon.
Settings in zabbix config are optimized as per what I know and see. I am not using Pollers, only Trappers (99% active agents), it is set to 50 and is about 15-30% busy at avg as someone suggested. It was set for values of 100, 150 and even 200 at times, did not make any difference.

BUT, I can tell that today the frequency of these failures decreased, it used to happen every 1-2 days, and today it happened first after 4 days.
I currently relate this to pure luck, OR, to allowing more freedom for iptables.

Some more information for more tests :

1. I thought it might have something to do with networking, so I've set iptables to allow full communication between both VPSs. doesn't seem to change anything.

2. I have been doing ping tests from 3 locations to the VPS, from time to time a ping fails, but only a single ping, it is something that should not make problems with 4.5 minutes delay time.

I just encountered another failure, which during that time ping remained the same, so this rules out network connectivity problems.

3. From previous testings I can tell there is no change to server load.
I am accessing the web interface and ssh with no problems (during the failure), no metric spikes.

4. I managed to isolate a zabbix server metric that Indicates with this problem occurs.
new values per second is always around 120-200+.
When this happens - it drops to a number below 100 (30,50,80).

So Not that there is no change, one could say that the load goes DOWN, BECAUSE new values goes below 100.

In order to avoid huge amounts of emails, I've set the trigger to check for the "new values per second" value, and not fire if below 100.

"down" time is about 10-20 minutes and then everything goes back to normal, without touching anything anywhere.

Thanks again.

Zabbix agent logs:

10604:20150802:113600.767 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
10604:20150802:113612.214 active check data upload to [server.xxx:10051] is working again
10604:20150802:113657.677 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
10604:20150802:113732.850 active check data upload to [server.xxx:10051] is working again
10604:20150802:113754.546 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
10604:20150802:114148.630 active check data upload to [server.xxx:10051] is working again
10604:20150802:114210.331 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
10604:20150802:114254.149 active check data upload to [server.xxx:10051] is working again
10604:20150802:114326.034 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
10604:20150802:114331.369 active check data upload to [server.xxx:10051] is working again

Ad Widget

Server throws Zabbix agent unreachable at random

Server throws Zabbix agent unreachable at random

Comment

Comment

Comment