Hi all,
I am seeking opinions and advice with regards to deploying a performant proxy that requires 50VPS, monitoring 6K Hosts via SNMP (8 items per Host, 15 minute interval, 4 triggers) - active mode.
I have run a proxy for 3 weeks in the above scenario and attempted to tweak its performance every 2 or 3 days by changing config and DB parameters.
After 21 days of troubleshooting I cannot find a solution. A queue builds and polled results continue to drift beyond their 15 minute expected interval, with a "no data" trigger (no data after 1000s ~ 16.4 minutes) littering the dashboard.
What's Happening?
On average, every 24 hours, VPS on the proxy dives down to an average of 25 (50% of the required) and hence a queue appears. Once a queue appears, the proxy is never able to catch up, no matter how long I leave it alone to see if it can figure itself out.
Zabbix Version (On Proxy and Server)
3.0
Proxy Config
ProxyMode=0
ServerPort=10051
HostnameItem=system.hostname
LogType=console
LogFile=/var/log/zabbix/zabbix_proxy.log
LogFileSize=10
DebugLevel=3
ProxyOfflineBuffer=48
HeartbeatFrequency=60
ConfigFrequency=1800
DataSenderFrequency=1
StartPollers=10
StartPollersUnreachable=10
StartTrappers=5
StartPingers=1
StartDiscoverers=2
StartHTTPPollers=2
SNMPTrapperFile=/var/log/snmptt/snmptrap.log
StartSNMPTrapper=1
HousekeepingFrequency=1
CacheSize=512M
HistoryCacheSize=512M
HistoryIndexCacheSize=512M
Timeout=10
TrapperTimeout=20
UnreachablePeriod=45
UnavailableDelay=20
UnreachableDelay=5
ExternalScripts=/usr/lib/zabbix/externalscripts
LogSlowQueries=3000
Proxy DB Config
[mysqld]skip-host-cache
skip-name-resolve
max_allowed_packet = 32M
table_open_cache = 1024
wait_timeout = 86400
innodb_buffer_pool_size = 8G
max_connections = 500
innodb_io_capacity = 8000
innodb_io_capacity_max = 12000
Zabbix-DB Config
[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc_messages_dir = /usr/share/mysql
lc_messages = en_US
skip-external-locking
connect_timeout = 5
wait_timeout = 86400
max_allowed_packet = 32M
thread_cache_size = 128
sort_buffer_size = 4M
bulk_insert_buffer_size = 16M
tmp_table_size = 32M
max_heap_table_size = 32M
key_buffer_size = 128M
table_open_cache = 400
myisam_sort_buffer_size = 512M
concurrent_insert = 2
read_buffer_size = 2M
read_rnd_buffer_size = 1M
query_cache_limit = 128K
query_cache_size = 64M
slow_query_log_file = /var/log/mysql/mariadb-slow.log
long_query_time = 10
expire_logs_days = 10
max_binlog_size = 100M
innodb_buffer_pool_size = 8G
innodb_log_buffer_size = 32M
innodb_file_per_table = 1
innodb_open_files = 400
innodb_io_capacity = 8000
innodb_io_capacity_max = 12000
innodb_flush_method = O_DIRECT
skip-host-cache
skip-name-resolve
Zabbix-Server Config
ListenPort=10051
LogType=console
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=5
PidFile=/var/run/zabbix/zabbix_server.pid
StartPollers=20
StartTrappers=10
SNMPTrapperFile=/var/log/snmptt/snmptt.log
HousekeepingFrequency=1
SenderFrequency=30
CacheSize=512M
CacheUpdateFrequency=60
HistoryCacheSize=128M
HistoryIndexCacheSize=64M
TrendCacheSize=512M
ValueCacheSize=512M
Timeout=5
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
LogSlowQueries=3000
StartProxyPollers=8
ProxyConfigFrequency=300
Proxy DB queries during poor performance
Executed every 1 second manually to see proxy queue
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 0 |
+-------------------------------------------------------------------------------+
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 32 |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 28 |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 0 |
+-------------------------------------------------------------------------------+
Proxy process during poor performance
zabbix_proxy -f
zabbix_proxy: configuration syncer [synced config 15762640 bytes in 1.900318 sec, idle 1800
zabbix_proxy: heartbeat sender [sending heartbeat message success in 0.002096 sec, idle 60
zabbix_proxy: data sender [sent 48 values in 0.003604 sec, idle 1 sec]
zabbix_proxy: poller #1 [got 5 values in 0.354294 sec, idle 1 sec]
zabbix_proxy: poller #2 [got 4 values in 0.307196 sec, idle 1 sec]
zabbix_proxy: poller #3 [got 6 values in 0.387001 sec, idle 1 sec]
zabbix_proxy: poller #4 [got 4 values in 0.301262 sec, idle 1 sec]
zabbix_proxy: poller #5 [got 5 values in 0.347205 sec, idle 1 sec]
zabbix_proxy: poller #6 [got 5 values in 0.331892 sec, idle 1 sec]
zabbix_proxy: poller #7 [got 5 values in 0.339897 sec, idle 1 sec]
zabbix_proxy: poller #8 [got 5 values in 0.368512 sec, idle 1 sec]
zabbix_proxy: poller #9 [got 4 values in 0.281030 sec, idle 1 sec]
zabbix_proxy: poller #10 [got 5 values in 0.359561 sec, idle 1 sec]
zabbix_proxy: unreachable poller #1 [got 1 values in 20.029659 sec, getting values]
zabbix_proxy: unreachable poller #2 [got 1 values in 20.033272 sec, getting values]
zabbix_proxy: unreachable poller #3 [got 1 values in 20.021567 sec, getting values]
zabbix_proxy: unreachable poller #4 [got 1 values in 20.037575 sec, getting values]
zabbix_proxy: unreachable poller #5 [got 2 values in 20.142452 sec, getting values]
zabbix_proxy: unreachable poller #6 [got 1 values in 20.029762 sec, getting values]
zabbix_proxy: unreachable poller #7 [got 1 values in 20.038905 sec, getting values]
zabbix_proxy: unreachable poller #8 [got 1 values in 20.028888 sec, getting values]
zabbix_proxy: unreachable poller #9 [got 1 values in 20.029567 sec, getting values]
zabbix_proxy: unreachable poller #10 [got 1 values in 20.038540 sec, getting values]
zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: icmp pinger #1 [got 0 values in 0.000004 sec, idle 5 sec]
zabbix_proxy: housekeeper [deleted 186839 records in 0.504918 sec, idle for 1 hour(s)]
Graphs

Please view the graph I have attached to see the recent zabbix proxy agent performance:
1. First values received are when I enabled the agent. I restarted the proxy which dumped the previous queue (notice the red initially dive down). Optimal performance at this point as proxy is at 50VPS and no queue seen for around 6 hours.
2. VPS dives down and queue begins creeping up
3. Restart for a second time, queue dissapears, VPS returns to its average of 50VPS. Experience around 13 hours without queue or VPS hit.
4. 13 hours later, queue begins to rise up and VPS begins to dive down. This time, I decided to leave the proxy untouched without a restart for more than 48 hours to see if it could fix itself.
5. Dropping the proxy 48 hours + later brought everything back to 50VPS immediately. Prior to dropping the proxy I ran the above DB queries and process monitor to see how many values were coming in and being sent. The proxy_history table only had 28K elements and the next_id seemed to be at max 30, ie backlog wasn't that big?
6. A clear repeating process
What would I like to happen?
I'd like to be able to leave the proxy untouched and have it perform without needing to restart the process when a queue begins. I was hoping 50 VPS wouldn't be too much of an ask (and it hasn't been for on average, 24 hours!).
I would really like to isolate where the bottleneck exists. I'm hoping the above graphs, description, config and database parameters are enough for an Expert to understand where the problem may lie.
Am I correct in presuming that once a queue begins, it causes a waterfall affect of processes to begin that impact the ability to further poll devices? So in effect, a positive feedback loop?
I cannot explain through my troubleshooting why restarting the process immediately fixes the problem, as I have been checking the proxy_history table which appears to have similar number of values within it during perfect operation vs when there is a big queue. (~ 180k vals housekept every 1 hour)
If you would like any more information or config please let me know.
Network?
When the proxy exhibits a problem polling target devices during a high queue scenario, I jump on the CLI and do bulk snmpget's to manually determine if there is a network problem. In every instance that this has happened, there has been no problem with the devices responding. This makes me believe the problem lies in the SNMP Poller process, OR the datasender process back to the zabbix-server. Does anyone agree?
Can I start multiple data senders to ensure the data sender process isn't a bottle neck? Just a thought.
Desired outcome from you guys
Would love to hear from anyone that has had a similar experience or who can highlight a glaring problem.
Problem at Zabbix Server receiver process?
Problem at Zabbix Server DB?
Problem at Zabbix Proxy poller process?
Problem at Zabbix Proxy data sender process?
Problem at Zabbix Proxy database?
Thank you kindly to all who have read this far
I am seeking opinions and advice with regards to deploying a performant proxy that requires 50VPS, monitoring 6K Hosts via SNMP (8 items per Host, 15 minute interval, 4 triggers) - active mode.
I have run a proxy for 3 weeks in the above scenario and attempted to tweak its performance every 2 or 3 days by changing config and DB parameters.
After 21 days of troubleshooting I cannot find a solution. A queue builds and polled results continue to drift beyond their 15 minute expected interval, with a "no data" trigger (no data after 1000s ~ 16.4 minutes) littering the dashboard.
What's Happening?
On average, every 24 hours, VPS on the proxy dives down to an average of 25 (50% of the required) and hence a queue appears. Once a queue appears, the proxy is never able to catch up, no matter how long I leave it alone to see if it can figure itself out.
Zabbix Version (On Proxy and Server)
3.0
Proxy Config
ProxyMode=0
ServerPort=10051
HostnameItem=system.hostname
LogType=console
LogFile=/var/log/zabbix/zabbix_proxy.log
LogFileSize=10
DebugLevel=3
ProxyOfflineBuffer=48
HeartbeatFrequency=60
ConfigFrequency=1800
DataSenderFrequency=1
StartPollers=10
StartPollersUnreachable=10
StartTrappers=5
StartPingers=1
StartDiscoverers=2
StartHTTPPollers=2
SNMPTrapperFile=/var/log/snmptt/snmptrap.log
StartSNMPTrapper=1
HousekeepingFrequency=1
CacheSize=512M
HistoryCacheSize=512M
HistoryIndexCacheSize=512M
Timeout=10
TrapperTimeout=20
UnreachablePeriod=45
UnavailableDelay=20
UnreachableDelay=5
ExternalScripts=/usr/lib/zabbix/externalscripts
LogSlowQueries=3000
Proxy DB Config
[mysqld]skip-host-cache
skip-name-resolve
max_allowed_packet = 32M
table_open_cache = 1024
wait_timeout = 86400
innodb_buffer_pool_size = 8G
max_connections = 500
innodb_io_capacity = 8000
innodb_io_capacity_max = 12000
Zabbix-DB Config
[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc_messages_dir = /usr/share/mysql
lc_messages = en_US
skip-external-locking
connect_timeout = 5
wait_timeout = 86400
max_allowed_packet = 32M
thread_cache_size = 128
sort_buffer_size = 4M
bulk_insert_buffer_size = 16M
tmp_table_size = 32M
max_heap_table_size = 32M
key_buffer_size = 128M
table_open_cache = 400
myisam_sort_buffer_size = 512M
concurrent_insert = 2
read_buffer_size = 2M
read_rnd_buffer_size = 1M
query_cache_limit = 128K
query_cache_size = 64M
slow_query_log_file = /var/log/mysql/mariadb-slow.log
long_query_time = 10
expire_logs_days = 10
max_binlog_size = 100M
innodb_buffer_pool_size = 8G
innodb_log_buffer_size = 32M
innodb_file_per_table = 1
innodb_open_files = 400
innodb_io_capacity = 8000
innodb_io_capacity_max = 12000
innodb_flush_method = O_DIRECT
skip-host-cache
skip-name-resolve
Zabbix-Server Config
ListenPort=10051
LogType=console
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=5
PidFile=/var/run/zabbix/zabbix_server.pid
StartPollers=20
StartTrappers=10
SNMPTrapperFile=/var/log/snmptt/snmptt.log
HousekeepingFrequency=1
SenderFrequency=30
CacheSize=512M
CacheUpdateFrequency=60
HistoryCacheSize=128M
HistoryIndexCacheSize=64M
TrendCacheSize=512M
ValueCacheSize=512M
Timeout=5
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
LogSlowQueries=3000
StartProxyPollers=8
ProxyConfigFrequency=300
Proxy DB queries during poor performance
Executed every 1 second manually to see proxy queue
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 0 |
+-------------------------------------------------------------------------------+
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 32 |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 28 |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
+-------------------------------------------------------------------------------+
| max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
+-------------------------------------------------------------------------------+
| 0 |
+-------------------------------------------------------------------------------+
Proxy process during poor performance
zabbix_proxy -f
zabbix_proxy: configuration syncer [synced config 15762640 bytes in 1.900318 sec, idle 1800
zabbix_proxy: heartbeat sender [sending heartbeat message success in 0.002096 sec, idle 60
zabbix_proxy: data sender [sent 48 values in 0.003604 sec, idle 1 sec]
zabbix_proxy: poller #1 [got 5 values in 0.354294 sec, idle 1 sec]
zabbix_proxy: poller #2 [got 4 values in 0.307196 sec, idle 1 sec]
zabbix_proxy: poller #3 [got 6 values in 0.387001 sec, idle 1 sec]
zabbix_proxy: poller #4 [got 4 values in 0.301262 sec, idle 1 sec]
zabbix_proxy: poller #5 [got 5 values in 0.347205 sec, idle 1 sec]
zabbix_proxy: poller #6 [got 5 values in 0.331892 sec, idle 1 sec]
zabbix_proxy: poller #7 [got 5 values in 0.339897 sec, idle 1 sec]
zabbix_proxy: poller #8 [got 5 values in 0.368512 sec, idle 1 sec]
zabbix_proxy: poller #9 [got 4 values in 0.281030 sec, idle 1 sec]
zabbix_proxy: poller #10 [got 5 values in 0.359561 sec, idle 1 sec]
zabbix_proxy: unreachable poller #1 [got 1 values in 20.029659 sec, getting values]
zabbix_proxy: unreachable poller #2 [got 1 values in 20.033272 sec, getting values]
zabbix_proxy: unreachable poller #3 [got 1 values in 20.021567 sec, getting values]
zabbix_proxy: unreachable poller #4 [got 1 values in 20.037575 sec, getting values]
zabbix_proxy: unreachable poller #5 [got 2 values in 20.142452 sec, getting values]
zabbix_proxy: unreachable poller #6 [got 1 values in 20.029762 sec, getting values]
zabbix_proxy: unreachable poller #7 [got 1 values in 20.038905 sec, getting values]
zabbix_proxy: unreachable poller #8 [got 1 values in 20.028888 sec, getting values]
zabbix_proxy: unreachable poller #9 [got 1 values in 20.029567 sec, getting values]
zabbix_proxy: unreachable poller #10 [got 1 values in 20.038540 sec, getting values]
zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection]
zabbix_proxy: icmp pinger #1 [got 0 values in 0.000004 sec, idle 5 sec]
zabbix_proxy: housekeeper [deleted 186839 records in 0.504918 sec, idle for 1 hour(s)]
Graphs
Please view the graph I have attached to see the recent zabbix proxy agent performance:
1. First values received are when I enabled the agent. I restarted the proxy which dumped the previous queue (notice the red initially dive down). Optimal performance at this point as proxy is at 50VPS and no queue seen for around 6 hours.
2. VPS dives down and queue begins creeping up
3. Restart for a second time, queue dissapears, VPS returns to its average of 50VPS. Experience around 13 hours without queue or VPS hit.
4. 13 hours later, queue begins to rise up and VPS begins to dive down. This time, I decided to leave the proxy untouched without a restart for more than 48 hours to see if it could fix itself.
5. Dropping the proxy 48 hours + later brought everything back to 50VPS immediately. Prior to dropping the proxy I ran the above DB queries and process monitor to see how many values were coming in and being sent. The proxy_history table only had 28K elements and the next_id seemed to be at max 30, ie backlog wasn't that big?
6. A clear repeating process

What would I like to happen?
I'd like to be able to leave the proxy untouched and have it perform without needing to restart the process when a queue begins. I was hoping 50 VPS wouldn't be too much of an ask (and it hasn't been for on average, 24 hours!).
I would really like to isolate where the bottleneck exists. I'm hoping the above graphs, description, config and database parameters are enough for an Expert to understand where the problem may lie.
Am I correct in presuming that once a queue begins, it causes a waterfall affect of processes to begin that impact the ability to further poll devices? So in effect, a positive feedback loop?
I cannot explain through my troubleshooting why restarting the process immediately fixes the problem, as I have been checking the proxy_history table which appears to have similar number of values within it during perfect operation vs when there is a big queue. (~ 180k vals housekept every 1 hour)
If you would like any more information or config please let me know.
Network?
When the proxy exhibits a problem polling target devices during a high queue scenario, I jump on the CLI and do bulk snmpget's to manually determine if there is a network problem. In every instance that this has happened, there has been no problem with the devices responding. This makes me believe the problem lies in the SNMP Poller process, OR the datasender process back to the zabbix-server. Does anyone agree?
Can I start multiple data senders to ensure the data sender process isn't a bottle neck? Just a thought.
Desired outcome from you guys
Would love to hear from anyone that has had a similar experience or who can highlight a glaring problem.
Problem at Zabbix Server receiver process?
Problem at Zabbix Server DB?
Problem at Zabbix Proxy poller process?
Problem at Zabbix Proxy data sender process?
Problem at Zabbix Proxy database?
Thank you kindly to all who have read this far
Comment