Ad Widget

Collapse

Zabbix Proxies issues - Queue overflow / performance tuning?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bartaadam87
    Junior Member
    • Nov 2019
    • 9

    #1

    Zabbix Proxies issues - Queue overflow / performance tuning?

    Hello all,

    We are facing issues with our Zabbix server / Proxies. Main ZBX server is EC2 instance in AWS (Ubuntu 20.04 LTS, Instance type: t3.large, Zabbix 4.4.8) + we have 6 active proxies (3 in AWS and 3 hosted in data centers = MH).
    Main problem are 2 MH proxies - they take care of most of hosts in our environment (VA3 Proxy has 83 hosts/10264 items, RUH Proxy has 23 hosts/2467 items).
    AWS <-> MH connection is set via AWS Load Balancer with URL alias (acting as ZBX sever access point on ports 10050 and 10051) and Citrix LB attaching public IPs to private ones for MH Proxies (again with ports 10050 and 10051 opened). Both sides have access rules set to allow only specified connections so our problem is not caused from outside (=DDOS attack etc).

    Zabbix server started to be unstable when we switched to active clients (we need this for auto registration of clients, IIS log monitoring etc.). I did some performance tuning which helped with cache issues and poller warning but the main issue persist:
    Two of our MH proxies randomly disconnecting and causing monitoring to go crazy (all clients show as unreachable) - when this happens proxy Last seen (age) time grows, but this happens only for one of two affected. Age grows for one (the other one seems to have bigger latency but it's visible) than they 'switch' and one is visible and so on. When this happens, I'm still able to telnet from Proxy to server without issues - first I was suspecting that underlying network issue is causing this, but our network team confirmed that there are no drops of connection.

    Click image for larger version

Name:	Screenshot_795.png
Views:	1424
Size:	31.4 KB
ID:	400675

    There are recurring patterns in performance graphs - when proxy drops, Zabbix queue goes up and number of processed values per second goes down rapidly:

    Click image for larger version

Name:	Screenshot_794.png
Views:	1577
Size:	83.4 KB
ID:	400671


    Another recurring anomaly is Utilization of trapper data collector goes up:

    Click image for larger version

Name:	Screenshot_793.png
Views:	1383
Size:	130.8 KB
ID:	400672

    Rest of server performance seems fine:

    Click image for larger version

Name:	Screenshot_791.png
Views:	1358
Size:	177.4 KB
ID:	400673
    Last 4 similar patterns are after performance tuning - server survives first spike of history data sync, and load is constant...then 'something' happens and connection goes down.
    The only solution for our problem (apart form disabling all proxy clients and enabling them slowly) is to restart server and proxies.

    Cache usage graph:

    Click image for larger version

Name:	Screenshot_792.png
Views:	1346
Size:	131.5 KB
ID:	400674

    Server + Proxies config files: https://drive.google.com/open?id=1U9...jxU9MX6uz4d3o1

    Changes in server config:
    StartPollers=10
    StartPollersUnreachable=15
    StartDiscoverers=3
    StartHTTPPollers=3
    CacheSize=1000M
    StartDBSyncers=5
    HistoryCacheSize=800M
    ValueCacheSize=1G
    LogSlowQueries=3000

    VA3 proxy config:
    StartPollers=10
    StartPreprocessors=5
    StartPollersUnreachable=15
    Timeout=5


    I will be happy for any suggestion

    Regards,
    Adam
Working...