Hi,
For the past month, I have been having issues with the Zabbix trapper components (Active). They randomly freeze and stops accepting connections for 6 to 12 hours at a time. Investigating the issue "jars" the server and it starts processing again. This problem does not affect passive polling or VMware polling, just the receiving of Active Zabbix Agent and Active Proxy item metrics.
I have run an strace on one of the Zabbix trappers that seem to be stuck in the "[processing]" state:
The trapper seems to be stuck just after that "read(7," line. Changing the log level through the "-R log_level_increase" command seems to unstick the trapper as it start processing immediately afterwards.
The Active agents connecting to the Zabbix Server have these error messages:
The Zabbix server is at 682 NVPS for 4130 hosts (332057 items), mostly VMWare monitoring.
The problem affects Active Items only, Passive and VMWare monitoring is not affected.
The Zabbix DB is on a separate Server, and both the Zabbix DB and Zabbix Server are pretty much Idle (LoadAVG ~ 0.5, ,even during those "issues").
Server has 8GB RAM, Database has 32GB RAM, no IOWaits on either.
Here is the Zabbix_server config values:
Has anyone ever encountered this? Help?
I was able to temporarily minimize the issue by changing all my 6 Zabbix Proxies to Passive in the meantime, but I'd really like to fix that "Active" issue.
Thanks!
Gleepwurp.
For the past month, I have been having issues with the Zabbix trapper components (Active). They randomly freeze and stops accepting connections for 6 to 12 hours at a time. Investigating the issue "jars" the server and it starts processing again. This problem does not affect passive polling or VMware polling, just the receiving of Active Zabbix Agent and Active Proxy item metrics.
I have run an strace on one of the Zabbix trappers that seem to be stuck in the "[processing]" state:
Code:
[gleepwurp@ServerX ~]$ sudo strace -s 256 -p 13518 -tdt Process 13518 attached - interrupt to quit [wait(0x137f) = 13518] pid 13518 stopped, [SIGSTOP] [wait(0x57f) = 13518] pid 13518 stopped, [SIGTRAP] 23:07:20.669408 read(7,
The Active agents connecting to the Zabbix Server have these error messages:
Code:
15368:20150316:221136.147 active check data upload to [www.xxx.yyy.zzz:10051] is working again 15368:20150316:221139.147 active check data upload to [www.xxx.yyy.zzz:10051] started to fail ([connect] cannot connect to [[www.xxx.yyy.zzz]:10051]: [4] Interrupted system call) 15368:20150316:221206.148 active check data upload to [www.xxx.yyy.zzz:10051] is working again 15368:20150316:221218.149 active check data upload to [www.xxx.yyy.zzz:10051] started to fail ([connect] cannot connect to [[www.xxx.yyy.zzz]:10051]: [4] Interrupted system call) 15368:20150316:221239.150 active check data upload to [www.xxx.yyy.zzz:10051] is working again 15368:20150316:221242.215 active check data upload to [www.xxx.yyy.zzz:10051] started to fail ([connect] cannot connect to [[www.xxx.yyy.zzz]:10051]: [4] Interrupted system call) 15368:20150316:221257.292 active check data upload to [www.xxx.yyy.zzz:10051] is working again
The problem affects Active Items only, Passive and VMWare monitoring is not affected.
The Zabbix DB is on a separate Server, and both the Zabbix DB and Zabbix Server are pretty much Idle (LoadAVG ~ 0.5, ,even during those "issues").
Server has 8GB RAM, Database has 32GB RAM, no IOWaits on either.
Here is the Zabbix_server config values:
Code:
DebugLevel=3 StartPollers=80 StartPollersUnreachable=40 StartTrappers=100 StartPingers=20 StartDiscoverers=10 CacheSize=512M CacheUpdateFrequency=300 StartDBSyncers=32 HistoryCacheSize=256M TrendCacheSize=128M Timeout=20 ProxyConfigFrequency=300 StartVMwareCollectors=20 VMwareFrequency=300 VMwarePerfFrequency=300 VMwareTimeout=30 VMwareCacheSize=512M ValueCacheSize=512M
Has anyone ever encountered this? Help?
I was able to temporarily minimize the issue by changing all my 6 Zabbix Proxies to Passive in the meantime, but I'd really like to fix that "Active" issue.
Thanks!
Gleepwurp.


Comment