Hello,
this topic might be a mix of few different problems, but at this point I'm not sure how to separate it. About 2 weeks ago I've started receiving repeatedly alerts about busy poller processes. Checking graph, it's basically at 97-100% all the time, with occasional drops to 70%. Also I'm noticing unreachable pollers around 65-80% busy, which was already lowered from 80-90%. I've already tried upgrading server (using now 3.4.5 which seems to have problem with agent connection after restarting server, which was mentioned in other topic here) and modifying some of the server parameters. Server has got more resources than it can use, database is located on separate machineThat's how it looks now:
I've tried looking through server log and it's basically full of messages:
What is annoying about those:
1. I don't really have access to hosts (and ofc agents installed there) in lines about "cannot send list of active checks" and those aren't hosts I'm planning to monitor for now.
2. PaaS proxy and all hosts connected through this proxy are disabled so I don't get why it's still getting checked.
3. Only those switches and ocasionally other hosts from last hosts page are having connection problems and, by looking at log file, it happens basically every minute.
I appreciate any advice and help. Thanks in advance.
this topic might be a mix of few different problems, but at this point I'm not sure how to separate it. About 2 weeks ago I've started receiving repeatedly alerts about busy poller processes. Checking graph, it's basically at 97-100% all the time, with occasional drops to 70%. Also I'm noticing unreachable pollers around 65-80% busy, which was already lowered from 80-90%. I've already tried upgrading server (using now 3.4.5 which seems to have problem with agent connection after restarting server, which was mentioned in other topic here) and modifying some of the server parameters. Server has got more resources than it can use, database is located on separate machineThat's how it looks now:
Code:
StartPollers=750 StartIPMIPollers=1 StartPollersUnreachable=80 StartTrappers=20 StartPingers=96 StartDiscoverers=40 StartHTTPPollers=10 StartTimers=40 StartEscalators=1 StartVMwareCollectors=5 VMwareFrequency=600 VMwarePerfFrequency=1800 VMwareCacheSize=16M VMwareTimeout=10 StartSNMPTrapper=1 SenderFrequency=30 CacheSize=1024M CacheUpdateFrequency=60 StartDBSyncers=20 HistoryCacheSize=128M HistoryIndexCacheSize=128M TrendCacheSize=64M ValueCacheSize=128M Timeout=10 TrapperTimeout=300 UnreachablePeriod=120 UnavailableDelay=60 UnreachableDelay=15 StartProxyPollers=5 ProxyConfigFrequency=3600 ProxyDataFrequency=1
Code:
12323:20180111:095228.800 SNMP agent item "ifInOctets.[ge-0/0/24]" on host "VC_PPD-1-1" failed: first network error, wait for 15 seconds 12810:20180111:095228.813 resuming SNMP agent checks on host "Switch A": connection restored 12813:20180111:095228.816 resuming SNMP agent checks on host "Switch B": connection restored 12804:20180111:095228.817 resuming SNMP agent checks on host "Switch C": connection restored 12776:20180111:095228.819 resuming SNMP agent checks on host "Switch D": connection restored 12791:20180111:095228.855 resuming SNMP agent checks on host "Switch E": connection restored 12809:20180111:095235.014 resuming SNMP agent checks on host "S3-S4": connection restored 12364:20180111:095243.280 SNMP agent item "1.3.6.1.4.1.2636.3.3.1.1.6.[524]" on host "VC_PPD-1-1" failed: another network error, wait for 15 seconds 12290:20180111:095251.680 SNMP agent item "ifOutQLen[8]" on host "Switch A" failed: first network error, wait for 15 seconds 12786:20180111:095258.026 resuming SNMP agent checks on host "VC_PPD-1-1": connection restored 12541:20180111:095301.495 SNMP agent item "ifOutErrors[XGigabitEthernet0/0/3]" on host "Switch D" failed: first network error, wait for 15 seconds 12168:20180111:095304.997 SNMP agent item "ifOutErrors[GigabitEthernet0/0/16]" on host "Switch E" failed: first network error, wait for 15 seconds 12285:20180111:095305.622 SNMP agent item "ifOutErrors[GigabitEthernet0/0/18]" on host "Switch B" failed: first network error, wait for 15 seconds 12655:20180111:095307.071 SNMP agent item "ifInErrors[NULL0]" on host "Switch C" failed: first network error, wait for 15 seconds 11997:20180111:095315.611 cannot connect to proxy "PaaS proxy": cannot connect to [[185.33.38.196]:10051]: [110] Connection timed out 12857:20180111:095324.800 cannot send list of active checks to "172.18.81.140": host [compute02] not found 12066:20180111:095327.146 SNMP agent item "ifOutOctets.[pimd]" on host "VC_PPD-1-1" failed: first network error, wait for 15 seconds 12818:20180111:095327.222 resuming SNMP agent checks on host "Switch A": connection restored 12777:20180111:095330.514 resuming SNMP agent checks on host "Switch D": connection restored 12782:20180111:095330.521 resuming SNMP agent checks on host "Switch B": connection restored 12805:20180111:095330.521 resuming SNMP agent checks on host "Switch E": connection restored 12793:20180111:095330.526 resuming SNMP agent checks on host "Switch C": connection restored 12862:20180111:095333.203 cannot send list of active checks to "172.18.20.111": host [redmine] not found 12850:20180111:095339.912 cannot send list of active checks to "172.18.81.159": host [network01] not found 12850:20180111:095341.517 cannot send list of active checks to "172.18.81.162": host [lb02] not found 11999:20180111:095341.981 sending configuration data to proxy "C4C proxy" at "185.33.38.74", datalen 146458 12847:20180111:095345.367 cannot send list of active checks to "172.18.81.124": host [sql02] not found 12856:20180111:095345.693 cannot send list of active checks to "172.18.20.112": host [vm06] not found 12861:20180111:095349.260 cannot send list of active checks to "172.18.20.72": host [horizon] not found 12817:20180111:095349.385 resuming SNMP agent checks on host "VC_PPD-1-1": connection restored
1. I don't really have access to hosts (and ofc agents installed there) in lines about "cannot send list of active checks" and those aren't hosts I'm planning to monitor for now.
2. PaaS proxy and all hosts connected through this proxy are disabled so I don't get why it's still getting checked.
3. Only those switches and ocasionally other hosts from last hosts page are having connection problems and, by looking at log file, it happens basically every minute.
I appreciate any advice and help. Thanks in advance.