Good Morning
We have started having a very high number of unreachable triggers in our system. This started after we had a network device issue that disconnected 6 of our proxies out of the 42 proxies we run. We fixed the network device and the 6 proxies we connected. About 6 hours later the unreachable triggers jump very high.
I have run some queries as an example:
July 8th to 9th we had 525
Sept 5th to Sept 6th we had 30840
We disabled all host, ~3700 and rebooted the Zabbix server and added 2 cores from 4 to 6 on Sept 7th
Sept 7th to Sept 8th we had ~21024
Screen shoots are attach as queue.doc
OK let me back up a bit and describe my system. We are running a Zabbix 4.4.10 on three systems. I have a web server, a Zabbix server and a database server. before the event we had ~3700 host across 42 proxies. These datacenters are across the US in physical data centers and cloud systems.
The web server is a 4 core, 16gb VM, the Zabbix Server is a 6 core 20gb VM, and the database server is a 16core 128gb VM. The database is large, 778GB.
When i look at the host system i notice that the Zabbix server has what i did not expect, a bunch of disk IO, attached as disk IO.doc.
Also attached is the system stats. stats.doc
My Zabbix server config is:
# This is a configuration file for Zabbix server daemon
# To get more information about Zabbix, visit http://www.zabbix.com
############ GENERAL PARAMETERS #################
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=16
DebugLevel=4
PidFile=/var/run/zabbix/zabbix_server.pid
SocketDir=/var/run/zabbix
DBHost=10.96.110.44
DBName=zabbix
DBUser=zabbix
DBPassword=Z@bB1x123456
DBPort=3306
############ ADVANCED PARAMETERS ################
StartPollers=30
# StartIPMIPollers=0
StartPreprocessors=8
StartPollersUnreachable=2
StartTrappers=160
StartPingers=8
StartDiscoverers=36
StartHTTPPollers=8
StartTimers=6
# StartEscalators=1
StartAlerters=18
# HousekeepingFrequency=1
# MaxHousekeeperDelete=10000
CacheSize=2048M
CacheUpdateFrequency=120
StartDBSyncers=6
HistoryCacheSize=2G
HistoryIndexCacheSize=1024M
TrendCacheSize=512M
ValueCacheSize=512M
Timeout=30
# TrapperTimeout=300
# UnreachablePeriod=45
# UnavailableDelay=60
# UnreachableDelay=15
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
# FpingLocation=/usr/sbin/fping
My question is, Is it normal that we have this high IO on the Zabbix server? 
We have started having a very high number of unreachable triggers in our system. This started after we had a network device issue that disconnected 6 of our proxies out of the 42 proxies we run. We fixed the network device and the 6 proxies we connected. About 6 hours later the unreachable triggers jump very high.
I have run some queries as an example:
July 8th to 9th we had 525
Sept 5th to Sept 6th we had 30840
We disabled all host, ~3700 and rebooted the Zabbix server and added 2 cores from 4 to 6 on Sept 7th
Sept 7th to Sept 8th we had ~21024
Screen shoots are attach as queue.doc
OK let me back up a bit and describe my system. We are running a Zabbix 4.4.10 on three systems. I have a web server, a Zabbix server and a database server. before the event we had ~3700 host across 42 proxies. These datacenters are across the US in physical data centers and cloud systems.
The web server is a 4 core, 16gb VM, the Zabbix Server is a 6 core 20gb VM, and the database server is a 16core 128gb VM. The database is large, 778GB.
When i look at the host system i notice that the Zabbix server has what i did not expect, a bunch of disk IO, attached as disk IO.doc.
Also attached is the system stats. stats.doc
My Zabbix server config is:
# This is a configuration file for Zabbix server daemon
# To get more information about Zabbix, visit http://www.zabbix.com
############ GENERAL PARAMETERS #################
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=16
DebugLevel=4
PidFile=/var/run/zabbix/zabbix_server.pid
SocketDir=/var/run/zabbix
DBHost=10.96.110.44
DBName=zabbix
DBUser=zabbix
DBPassword=Z@bB1x123456
DBPort=3306
############ ADVANCED PARAMETERS ################
StartPollers=30
# StartIPMIPollers=0
StartPreprocessors=8
StartPollersUnreachable=2
StartTrappers=160
StartPingers=8
StartDiscoverers=36
StartHTTPPollers=8
StartTimers=6
# StartEscalators=1
StartAlerters=18
# HousekeepingFrequency=1
# MaxHousekeeperDelete=10000
CacheSize=2048M
CacheUpdateFrequency=120
StartDBSyncers=6
HistoryCacheSize=2G
HistoryIndexCacheSize=1024M
TrendCacheSize=512M
ValueCacheSize=512M
Timeout=30
# TrapperTimeout=300
# UnreachablePeriod=45
# UnavailableDelay=60
# UnreachableDelay=15
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
# FpingLocation=/usr/sbin/fping
My question is, Is it normal that we have this high IO on the Zabbix server? 

Comment