6k Hosts, 50VPS, 8 Items per Host, 4 Triggers - Proxy queue never catches up

supa_marty

Junior Member

Joined: Oct 2019

Posts: 1
#1

6k Hosts, 50VPS, 8 Items per Host, 4 Triggers - Proxy queue never catches up

14-10-2019, 03:18

Hi all,

I am seeking opinions and advice with regards to deploying a performant proxy that requires 50VPS, monitoring 6K Hosts via SNMP (8 items per Host, 15 minute interval, 4 triggers) - active mode.

I have run a proxy for 3 weeks in the above scenario and attempted to tweak its performance every 2 or 3 days by changing config and DB parameters.

After 21 days of troubleshooting I cannot find a solution. A queue builds and polled results continue to drift beyond their 15 minute expected interval, with a "no data" trigger (no data after 1000s ~ 16.4 minutes) littering the dashboard.

What's Happening?
On average, every 24 hours, VPS on the proxy dives down to an average of 25 (50% of the required) and hence a queue appears. Once a queue appears, the proxy is never able to catch up, no matter how long I leave it alone to see if it can figure itself out.

Zabbix Version (On Proxy and Server)
3.0
Proxy Config
ProxyMode=0
ServerPort=10051
HostnameItem=system.hostname
LogType=console
LogFile=/var/log/zabbix/zabbix_proxy.log
LogFileSize=10
DebugLevel=3
ProxyOfflineBuffer=48
HeartbeatFrequency=60
ConfigFrequency=1800
DataSenderFrequency=1
StartPollers=10
StartPollersUnreachable=10
StartTrappers=5
StartPingers=1
StartDiscoverers=2
StartHTTPPollers=2
SNMPTrapperFile=/var/log/snmptt/snmptrap.log
StartSNMPTrapper=1
HousekeepingFrequency=1
CacheSize=512M
HistoryCacheSize=512M
HistoryIndexCacheSize=512M
Timeout=10
TrapperTimeout=20
UnreachablePeriod=45
UnavailableDelay=20
UnreachableDelay=5
ExternalScripts=/usr/lib/zabbix/externalscripts
LogSlowQueries=3000

Proxy DB Config
[mysqld]skip-host-cache
skip-name-resolve
max_allowed_packet = 32M
table_open_cache = 1024
wait_timeout = 86400
innodb_buffer_pool_size = 8G
max_connections = 500
innodb_io_capacity = 8000
innodb_io_capacity_max = 12000

Zabbix-DB Config
[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc_messages_dir = /usr/share/mysql
lc_messages = en_US
skip-external-locking
connect_timeout = 5
wait_timeout = 86400
max_allowed_packet = 32M
thread_cache_size = 128
sort_buffer_size = 4M
bulk_insert_buffer_size = 16M
tmp_table_size = 32M
max_heap_table_size = 32M
key_buffer_size = 128M
table_open_cache = 400
myisam_sort_buffer_size = 512M
concurrent_insert = 2
read_buffer_size = 2M
read_rnd_buffer_size = 1M
query_cache_limit = 128K
query_cache_size = 64M
slow_query_log_file = /var/log/mysql/mariadb-slow.log
long_query_time = 10
expire_logs_days = 10
max_binlog_size = 100M

innodb_buffer_pool_size = 8G
innodb_log_buffer_size = 32M
innodb_file_per_table = 1
innodb_open_files = 400
innodb_io_capacity = 8000
innodb_io_capacity_max = 12000
innodb_flush_method = O_DIRECT

skip-host-cache
skip-name-resolve

Zabbix-Server Config
ListenPort=10051
LogType=console
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=5
PidFile=/var/run/zabbix/zabbix_server.pid
StartPollers=20
StartTrappers=10
SNMPTrapperFile=/var/log/snmptt/snmptt.log
HousekeepingFrequency=1
SenderFrequency=30
CacheSize=512M
CacheUpdateFrequency=60
HistoryCacheSize=128M
HistoryIndexCacheSize=64M
TrendCacheSize=512M
ValueCacheSize=512M
Timeout=5
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
LogSlowQueries=3000
StartProxyPollers=8
ProxyConfigFrequency=300

Proxy DB queries during poor performance
Executed every 1 second manually to see proxy queue

MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 0 |
+-------------------------------------------------------------------------------+

MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 32 |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 28 |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 0 |
+-------------------------------------------------------------------------------+

Proxy process during poor performance

zabbix_proxy -f
zabbix_proxy: configuration syncer [synced config 15762640 bytes in 1.900318 sec, idle 1800
zabbix_proxy: heartbeat sender [sending heartbeat message success in 0.002096 sec, idle 60
zabbix_proxy: data sender [sent 48 values in 0.003604 sec, idle 1 sec]
zabbix_proxy: poller #1 [got 5 values in 0.354294 sec, idle 1 sec]
zabbix_proxy: poller #2 [got 4 values in 0.307196 sec, idle 1 sec]
zabbix_proxy: poller #3 [got 6 values in 0.387001 sec, idle 1 sec]
zabbix_proxy: poller #4 [got 4 values in 0.301262 sec, idle 1 sec]
zabbix_proxy: poller #5 [got 5 values in 0.347205 sec, idle 1 sec]
zabbix_proxy: poller #6 [got 5 values in 0.331892 sec, idle 1 sec]
zabbix_proxy: poller #7 [got 5 values in 0.339897 sec, idle 1 sec]
zabbix_proxy: poller #8 [got 5 values in 0.368512 sec, idle 1 sec]
zabbix_proxy: poller #9 [got 4 values in 0.281030 sec, idle 1 sec]
zabbix_proxy: poller #10 [got 5 values in 0.359561 sec, idle 1 sec]
zabbix_proxy: unreachable poller #1 [got 1 values in 20.029659 sec, getting values]
zabbix_proxy: unreachable poller #2 [got 1 values in 20.033272 sec, getting values]
zabbix_proxy: unreachable poller #3 [got 1 values in 20.021567 sec, getting values]
zabbix_proxy: unreachable poller #4 [got 1 values in 20.037575 sec, getting values]
zabbix_proxy: unreachable poller #5 [got 2 values in 20.142452 sec, getting values]
zabbix_proxy: unreachable poller #6 [got 1 values in 20.029762 sec, getting values]
zabbix_proxy: unreachable poller #7 [got 1 values in 20.038905 sec, getting values]
zabbix_proxy: unreachable poller #8 [got 1 values in 20.028888 sec, getting values]
zabbix_proxy: unreachable poller #9 [got 1 values in 20.029567 sec, getting values]
zabbix_proxy: unreachable poller #10 [got 1 values in 20.038540 sec, getting values]
zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: icmp pinger #1 [got 0 values in 0.000004 sec, idle 5 sec]
zabbix_proxy: housekeeper [deleted 186839 records in 0.504918 sec, idle for 1 hour(s)]

Graphs

Please view the graph I have attached to see the recent zabbix proxy agent performance:
1. First values received are when I enabled the agent. I restarted the proxy which dumped the previous queue (notice the red initially dive down). Optimal performance at this point as proxy is at 50VPS and no queue seen for around 6 hours.
2. VPS dives down and queue begins creeping up
3. Restart for a second time, queue dissapears, VPS returns to its average of 50VPS. Experience around 13 hours without queue or VPS hit.
4. 13 hours later, queue begins to rise up and VPS begins to dive down. This time, I decided to leave the proxy untouched without a restart for more than 48 hours to see if it could fix itself.
5. Dropping the proxy 48 hours + later brought everything back to 50VPS immediately. Prior to dropping the proxy I ran the above DB queries and process monitor to see how many values were coming in and being sent. The proxy_history table only had 28K elements and the next_id seemed to be at max 30, ie backlog wasn't that big?
6. A clear repeating process

What would I like to happen?
I'd like to be able to leave the proxy untouched and have it perform without needing to restart the process when a queue begins. I was hoping 50 VPS wouldn't be too much of an ask (and it hasn't been for on average, 24 hours!).

I would really like to isolate where the bottleneck exists. I'm hoping the above graphs, description, config and database parameters are enough for an Expert to understand where the problem may lie.

Am I correct in presuming that once a queue begins, it causes a waterfall affect of processes to begin that impact the ability to further poll devices? So in effect, a positive feedback loop?

I cannot explain through my troubleshooting why restarting the process immediately fixes the problem, as I have been checking the proxy_history table which appears to have similar number of values within it during perfect operation vs when there is a big queue. (~ 180k vals housekept every 1 hour)

If you would like any more information or config please let me know.

Network?
When the proxy exhibits a problem polling target devices during a high queue scenario, I jump on the CLI and do bulk snmpget's to manually determine if there is a network problem. In every instance that this has happened, there has been no problem with the devices responding. This makes me believe the problem lies in the SNMP Poller process, OR the datasender process back to the zabbix-server. Does anyone agree?

Can I start multiple data senders to ensure the data sender process isn't a bottle neck? Just a thought.

Desired outcome from you guys
Would love to hear from anyone that has had a similar experience or who can highlight a glaring problem.

Problem at Zabbix Server receiver process?
Problem at Zabbix Server DB?
Problem at Zabbix Proxy poller process?
Problem at Zabbix Proxy data sender process?
Problem at Zabbix Proxy database?

Thank you kindly to all who have read this far
Tags: None
kloczek

Senior Member

Joined: Jun 2006

Posts: 1771
#2

14-10-2019, 14:13

1) Move to zabbix >=4.0
2) query_cache_limit = 128K, query_cache_size = 64M - it is not possible to improve zabbix db queries as almost all queries are operation on constantly moving windows of data. Decrease those values to minimum allowed values allowed by MySQL to not waste that memory
3) What is the IO rate between writes and reads on main DB backend? You should have not at least 20:1 write to read IOs it means that you are not caching enough data in memory -> increase innodb pool.
Low latency write queries (like inserts and updates) depends on low latency of the read IOs. If by not enough memory to cache necessary data on selects your engine is forced to read data dfrom storage (even from fast NVMe one) your DB backend always will be slow on injecting new incoming data.
4) you are using SNMP so in that case fact that proxy is active is not relevant. You are using passive monitoring (SNMP is type of passive monitoring)
5) how many of those proxies you have? If all those 6k hosts are monitored over single proxy it will never work correctly because that proxy is bottleneck. You must increase proxy poller s and if it will be not enough you must spread monitoring of those hosts over more proxies.
6) if you have not uniform rate of the SNMP items queries and in peak you have some big number of SNMP queries and flood of data which needs to be delivered to server and that is causing issue you must compile your own proxy with change like below (it is only example which increases some hardcoded bandwidth to 1k points and you may use different value):

Code:

$ cat zabbix-default-proxy_ZBX_MAX_HRECORDS_50000.patch --- a/include/proxy.h~ +++ b/include/proxy.h @@ -27,8 +27,8 @@ #define ZBX_PROXYMODE_ACTIVE 0 #define ZBX_PROXYMODE_PASSIVE 1 -#define ZBX_MAX_HRECORDS 1000 -#define ZBX_MAX_HRECORDS_TOTAL 10000 +#define ZBX_MAX_HRECORDS 50000 +#define ZBX_MAX_HRECORDS_TOTAL 100000 #define ZBX_PROXY_DATA_DONE 0 #define ZBX_PROXY_DATA_MORE 1

ZBX_MAX_HRECORDS it is trolling bandwidth of the flow from proxy to server (you must have enough strong server and DB backend to be able process incoming data faster and flush from zabbix server write cache to DB backend).
In many cases large scale monitoring stacks 1k points/exchange between srv<>prx creates bottleneck.

http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
https://kloczek.wordpress.com/
zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
My zabbix templates https://github.com/kloczek/zabbix-templates
Comment

Previous template Next

Ad Widget

6k Hosts, 50VPS, 8 Items per Host, 4 Triggers - Proxy queue never catches up

6k Hosts, 50VPS, 8 Items per Host, 4 Triggers - Proxy queue never catches up

Comment