Hello All,
There seems to be this weird scenario i am facing with respect to the agents off late where graphs show disconnected dots , no straight lines, keep getting no data recieved alerts but telnet to agent port looks fine. I really need help here to figure out what's going wrong. Zabbix queue is also piled up bigtime.
I tried changing to active agent config, that didn't help either, where increasing pollers also didn't help.
I have mix and match items with intervals ranging from 60s min to 1day max,
Zabbix Queue looks like this
There's only minimal active items added, but max on zabbix_aget and snmp checks, below is proxy config , this is pretty similar to other 6 proxies. Data gathering process is only 60% rest all is cool (from the data gathering process graph). values processed is about 125+, and queue for this proxy is avg 900.
Servers are not utilised much. proxy servers have enough ram , only 60% used, cpu load looks fine 20-30% (all the proxies). Below is the server config
zabbx server details
mysql config
I am like more confused now. Trying to figure out what's the optimal configurations to get these working. I am not in a position to do multiple start/stop to zabbix severs as well as proxies (as they are production) but two-four can be tried. Earlier when these happened, i tried increasing the pollers and it went away, now increasing pollers isn't helping. Logs doesn't show much - except network_error , trying after 15 seconds and some ZBX_TCP_READ() timed out - where as continues
both works flawlessley. I mean i don't see disconnections using continues NC commands or ping's so definitely network isn't the issue.
Zabbix is approximately 4-5 years old where i first installed 2.0.6 and now in 3.0.13 (all are centos 6.3/6.5 server plus 6 proxies with mysql as backed)
Number of hosts monitored via server and proxies.
Server : ~ 1500
Proxies together : ~ 1500
Currently i am stuck with below points
1). Disconnected graphs on both SNMP monitored items as well as zabbix-agents , number of such having issue ~ 20+ on agent about 30 on snmp.
2). SNMP traffic data is disconnected, and showing less data than router shows, i.e if router says interface clocking 300Mb, i am seeing only ~100Mb on zabbix - this is another issue( interval for traffic is 5mins)
3). No-data received alerts on agents. I am able to get 1 when i manually do agent_ping but zabbix shows alerts, and on the latest data as well i am seeing data is missing ( interval for agent.ping is 1m )
4). External scripts failing to run. I had some expect/shell/perl based scripts to login to routers to run and get some data show on zabbix - this use to work flawlessly on pervious version ( 2.2.12), but in this they are failing - 3.0.13 - either they show timeout running , or they run hanging on the console.
Any pointers are greatly helpful.
Thanks
There seems to be this weird scenario i am facing with respect to the agents off late where graphs show disconnected dots , no straight lines, keep getting no data recieved alerts but telnet to agent port looks fine. I really need help here to figure out what's going wrong. Zabbix queue is also piled up bigtime.
I tried changing to active agent config, that didn't help either, where increasing pollers also didn't help.
I have mix and match items with intervals ranging from 60s min to 1day max,
Code:
[TABLE] [TR] [TD]Number of hosts (enabled/disabled/templates)[/TD] [TD]5693[/TD] [TD]3018 / 2289 / 386[/TD] [/TR] [TR] [TD]Number of items (enabled/disabled/not supported)[/TD] [TD]24641[/TD] [TD]22095 / 1513 / 1033[/TD] [/TR] [TR] [TD]Number of triggers (enabled/disabled [problem/ok])[/TD] [TD]7277[/TD] [TD]6618 / 659 [373 / 6245][/TD] [/TR] [TR] [TD]Number of users (online)[/TD] [TD]73[/TD] [TD]23[/TD] [/TR] [TR] [TD]Required server performance, new values per second[/TD] [TD]84.56[/TD] [/TR] [/TABLE]
Code:
[TABLE] [TR] tems 5 seconds 10 seconds 30 seconds 1 minute 5 minutes More than 10 minutes [/TR] [TR] [TD]Zabbix agent[/TD] [TD]16[/TD] [TD]27[/TD] [TD]8[/TD] [TD]41[/TD] [TD]22[/TD] [TD]90[/TD] [/TR] [TR] [TD]Zabbix agent (active)[/TD] [TD]0[/TD] [TD]0[/TD] [TD]3[/TD] [TD]17[/TD] [TD]11[/TD] [TD]174[/TD] [/TR] [TR] [TD]Simple check[/TD] [TD]82[/TD] [TD]127[/TD] [TD]14[/TD] [TD]24[/TD] [TD]0[/TD] [TD]0[/TD] [/TR] [TR] [TD]SNMPv1 agent[/TD] [TD]0[/TD] [TD]0[/TD] [TD]0[/TD] [TD]0[/TD] [TD]0[/TD] [TD]0[/TD] [/TR] [TR] [TD]SNMPv2 agent[/TD] [TD]8[/TD] [TD]6[/TD] [TD]1[/TD] [TD]0[/TD] [TD]4[/TD] [TD]10[/TD] [/TR] [/TABLE]
Code:
Server=XX.XX.XX.XX Hostname=Zabbix-Proxy LogFile=/var/log/zabbix/zabbix_proxy.log LogFileSize=300 DebugLevel=4 PidFile=/var/run/zabbix/zabbix_proxy.pid DBName=zabbix DBUser=zabbix DBPassword=password ProxyLocalBuffer=3 ProxyOfflineBuffer=4 ConfigFrequency=120 DataSenderFrequency=30 StartPollers=275 StartPollersUnreachable=120 StartTrappers=60 StartPingers=90 StartSNMPTrapper=1 HousekeepingFrequency=3 CacheSize=1G StartDBSyncers=50 HistoryCacheSize=1G HistoryIndexCacheSize=1G Timeout=30 UnreachablePeriod=90 FpingLocation=/usr/local/sbin/fping LogSlowQueries=300
Code:
LogFile=/var/log/zabbix/zabbix_server.log LogFileSize=500 DebugLevel=4 PidFile=/var/run/zabbix/zabbix_server.pid DBName=zabbix DBUser=zabbix DBPassword=password StartPollers=300 StartIPMIPollers=1 StartPollersUnreachable=150 StartTrappers=130 StartPingers=120 StartDiscoverers=10 StartSNMPTrapper=1 ListenIP=0.0.0.0 HousekeepingFrequency=2 MaxHousekeeperDelete=300 SenderFrequency=360 CacheSize=1G CacheUpdateFrequency=300 StartDBSyncers=15 HistoryCacheSize=256M HistoryIndexCacheSize=256M TrendCacheSize=1G ValueCacheSize=128M Timeout=30 TrapperTimeout=180 UnreachablePeriod=600 UnavailableDelay=180 AlertScriptsPath=/etc/zabbix/alert.d/ FpingLocation=/usr/local/sbin/fping LogSlowQueries=300 StartProxyPollers=2 ProxyDataFrequency=180
Code:
[root@zbx_server ~]# free -g
total used free shared buffers cached
Mem: 125 86 39 0 0 60
-/+ buffers/cache: 25 100
Swap: 9 0 9
[root@zbx_server ~]#
CPU - 40 Core, MySql ~ 170G db
Code:
[mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql symbolic-links=0 long_query_time = 10 log-queries-not-using-indexes=YES innodb_lock_wait_timeout=500 innodb_locks_unsafe_for_binlog=1 innodb_file_per_table innodb_flush_method=O_DIRECT innodb_log_file_size=1G innodb_buffer_pool_size=48G innodb_file_per_table max_allowed_packet = 128M innodb_additional_mem_pool_size = 30M innodb_thread_concurrency = 8 key_buffer_size = 60M max_connections=700 table_cache=4096 tmp_table_size = 32M thread_cache_size = 64 query_cache_limit=64M thread_cache_size=512 read_buffer_size=2M log-bin=mysql-bin binlog-do-db=zabbix server-id=9 expire_logs_days=3 max_binlog_size=100M [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid [client] user=root password=password
Code:
nc -z host_ip 10050 OR nc -z host_ip 161
Zabbix is approximately 4-5 years old where i first installed 2.0.6 and now in 3.0.13 (all are centos 6.3/6.5 server plus 6 proxies with mysql as backed)
Number of hosts monitored via server and proxies.
Server : ~ 1500
Proxies together : ~ 1500
Currently i am stuck with below points
1). Disconnected graphs on both SNMP monitored items as well as zabbix-agents , number of such having issue ~ 20+ on agent about 30 on snmp.
2). SNMP traffic data is disconnected, and showing less data than router shows, i.e if router says interface clocking 300Mb, i am seeing only ~100Mb on zabbix - this is another issue( interval for traffic is 5mins)
3). No-data received alerts on agents. I am able to get 1 when i manually do agent_ping but zabbix shows alerts, and on the latest data as well i am seeing data is missing ( interval for agent.ping is 1m )
4). External scripts failing to run. I had some expect/shell/perl based scripts to login to routers to run and get some data show on zabbix - this use to work flawlessly on pervious version ( 2.2.12), but in this they are failing - 3.0.13 - either they show timeout running , or they run hanging on the console.
Any pointers are greatly helpful.
Thanks
)
Comment