Ad Widget

Collapse

Server throws Zabbix agent unreachable at random

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • phpclub
    Junior Member
    • Jul 2015
    • 7

    #1

    Server throws Zabbix agent unreachable at random

    Hello,

    I am having a problem which despites all my research and readings, I can't find an answer to.

    A little bit of History
    ---------
    I am using zabbix for more than a year, studying it day by day, learning new things, optimizing and tweaking.

    For most of the time I was running a small setup, about 50 NVPS on a VPS, had problems, tweaked, changed things and got it to work.

    I recently changed my setup to include about 700 hosts / 250 NVPS, I knew I had to change the VPS and it is currently running on 2 SSD VPS, 1 for zabbix server, 1 for Percona MySQL DB.


    The problem
    ------------------
    Active Agents - "Zabbix Agent is unreachable for 2 minutes"

    Like I said, I had a lot of problems, iowait problems, mysql settings not optimized, regular unparturitioned tables, zabbix server settings, etc..

    The symptom of them all is the above message which is emailed to me in the hundreds or more for every host I have, sometimes it jumps between Problem/OK states a lot and I get thousands of emails until I reboot the server.


    Currently DB is optimized, partitioned, same goes for Zabbix server various caches (most of them are 50-90% free).

    For most of the time, things are working great, but sometimes I encounter the above problem, all metrics for both VPS looks normal, zabbix server looks OK, no spikes in values that I saw, same goes for the DB (I am monitoring Percona MySQL templates).

    I can't figure out where the problem is, this system is in production and should not play like this. I am struggling with this for very long time, every time I think, "hey, you got this to work", I get another problem.

    Every help is much appreciated.
    Thanks in advance.
  • delija91p
    Junior Member
    • Nov 2014
    • 24

    #2
    I was having the same exact issue this whole week and last week. Something that seemed to help me out was increasing the Timeout parameter in zabbix_server.conf from the default 3 to 20. Ever since then, I haven't had any false triggers like that. Give that a shot!

    Comment

    • tchjts1
      Senior Member
      • May 2008
      • 1605

      #3
      As you are adding new hosts and monitoring more and more items, are you also adjusting your settings in zabbix_server.conf to optimize allocated resources? This post can help you out: https://www.zabbix.com/forum/showthread.php?t=47781

      Additionally, I personally would not set my "Agent unreachable" alert for such a small value as 2 minutes. It may work a little smoother for you if you go with 5 minutes.

      Comment

      • phpclub
        Junior Member
        • Jul 2015
        • 7

        #4
        Thanks for the answers

        1. I havn't tried changing the Timeout setting, for some reason it doesn't make sense to change it. I will try it out.

        2.
        I am currently setting alerts for 3 minutes plus 90 seconds escalation time, so its not that.
        BUT, I do want alerts after 2 minutes, otherwise zabbix is of no use.
        I need alerts in realtime, not after the client calls.

        3. The # of hosts/nvps is currently steady and is not expected to grow soon.
        Settings in zabbix config are optimized as per what I know and see. I am not using Pollers, only Trappers (99% active agents), it is set to 50 and is about 15-30% busy at avg as someone suggested. It was set for values of 100, 150 and even 200 at times, did not make any difference.

        BUT, I can tell that today the frequency of these failures decreased, it used to happen every 1-2 days, and today it happened first after 4 days.
        I currently relate this to pure luck, OR, to allowing more freedom for iptables.

        Some more information for more tests :

        1. I thought it might have something to do with networking, so I've set iptables to allow full communication between both VPSs. doesn't seem to change anything.

        2. I have been doing ping tests from 3 locations to the VPS, from time to time a ping fails, but only a single ping, it is something that should not make problems with 4.5 minutes delay time.

        I just encountered another failure, which during that time ping remained the same, so this rules out network connectivity problems.

        3. From previous testings I can tell there is no change to server load.
        I am accessing the web interface and ssh with no problems (during the failure), no metric spikes.

        4. I managed to isolate a zabbix server metric that Indicates with this problem occurs.
        new values per second is always around 120-200+.
        When this happens - it drops to a number below 100 (30,50,80).

        So Not that there is no change, one could say that the load goes DOWN, BECAUSE new values goes below 100.

        In order to avoid huge amounts of emails, I've set the trigger to check for the "new values per second" value, and not fire if below 100.

        "down" time is about 10-20 minutes and then everything goes back to normal, without touching anything anywhere.

        Thanks again.

        Zabbix agent logs:

        10604:20150802:113600.767 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
        10604:20150802:113612.214 active check data upload to [server.xxx:10051] is working again
        10604:20150802:113657.677 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
        10604:20150802:113732.850 active check data upload to [server.xxx:10051] is working again
        10604:20150802:113754.546 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
        10604:20150802:114148.630 active check data upload to [server.xxx:10051] is working again
        10604:20150802:114210.331 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
        10604:20150802:114254.149 active check data upload to [server.xxx:10051] is working again
        10604:20150802:114326.034 active check data upload to [server.xxx:10051] started to fail ([connect] cannot connect to [[server.xxx]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
        10604:20150802:114331.369 active check data upload to [server.xxx:10051] is working again

        Comment

        Working...