Ad Widget

Collapse

100% busy poller processes + cannot send list of active checks

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Libiana
    Junior Member
    • Jan 2018
    • 1

    #1

    100% busy poller processes + cannot send list of active checks

    Hello,

    this topic might be a mix of few different problems, but at this point I'm not sure how to separate it. About 2 weeks ago I've started receiving repeatedly alerts about busy poller processes. Checking graph, it's basically at 97-100% all the time, with occasional drops to 70%. Also I'm noticing unreachable pollers around 65-80% busy, which was already lowered from 80-90%. I've already tried upgrading server (using now 3.4.5 which seems to have problem with agent connection after restarting server, which was mentioned in other topic here) and modifying some of the server parameters. Server has got more resources than it can use, database is located on separate machineThat's how it looks now:

    Code:
    StartPollers=750
    StartIPMIPollers=1
    StartPollersUnreachable=80
    StartTrappers=20
    StartPingers=96
    StartDiscoverers=40
    StartHTTPPollers=10
    StartTimers=40
    StartEscalators=1
    StartVMwareCollectors=5
    VMwareFrequency=600
    VMwarePerfFrequency=1800
    VMwareCacheSize=16M
    VMwareTimeout=10
    StartSNMPTrapper=1
    SenderFrequency=30
    CacheSize=1024M
    CacheUpdateFrequency=60
    StartDBSyncers=20
    HistoryCacheSize=128M
    HistoryIndexCacheSize=128M
    TrendCacheSize=64M
    ValueCacheSize=128M
    Timeout=10
    TrapperTimeout=300
    UnreachablePeriod=120
    UnavailableDelay=60
    UnreachableDelay=15
    StartProxyPollers=5
    ProxyConfigFrequency=3600
    ProxyDataFrequency=1
    I've tried looking through server log and it's basically full of messages:
    Code:
     12323:20180111:095228.800 SNMP agent item "ifInOctets.[ge-0/0/24]" on host "VC_PPD-1-1" failed: first network error, wait for 15 seconds
     12810:20180111:095228.813 resuming SNMP agent checks on host "Switch A": connection restored
     12813:20180111:095228.816 resuming SNMP agent checks on host "Switch B": connection restored
     12804:20180111:095228.817 resuming SNMP agent checks on host "Switch C": connection restored
     12776:20180111:095228.819 resuming SNMP agent checks on host "Switch D": connection restored
     12791:20180111:095228.855 resuming SNMP agent checks on host "Switch E": connection restored
     12809:20180111:095235.014 resuming SNMP agent checks on host "S3-S4": connection restored
     12364:20180111:095243.280 SNMP agent item "1.3.6.1.4.1.2636.3.3.1.1.6.[524]" on host "VC_PPD-1-1" failed: another network error, wait for 15 seconds
     12290:20180111:095251.680 SNMP agent item "ifOutQLen[8]" on host "Switch A" failed: first network error, wait for 15 seconds
     12786:20180111:095258.026 resuming SNMP agent checks on host "VC_PPD-1-1": connection restored
     12541:20180111:095301.495 SNMP agent item "ifOutErrors[XGigabitEthernet0/0/3]" on host "Switch D" failed: first network error, wait for 15 seconds
     12168:20180111:095304.997 SNMP agent item "ifOutErrors[GigabitEthernet0/0/16]" on host "Switch E" failed: first network error, wait for 15 seconds
     12285:20180111:095305.622 SNMP agent item "ifOutErrors[GigabitEthernet0/0/18]" on host "Switch B" failed: first network error, wait for 15 seconds
     12655:20180111:095307.071 SNMP agent item "ifInErrors[NULL0]" on host "Switch C" failed: first network error, wait for 15 seconds
     11997:20180111:095315.611 cannot connect to proxy "PaaS proxy": cannot connect to [[185.33.38.196]:10051]: [110] Connection timed out
     12857:20180111:095324.800 cannot send list of active checks to "172.18.81.140": host [compute02] not found
     12066:20180111:095327.146 SNMP agent item "ifOutOctets.[pimd]" on host "VC_PPD-1-1" failed: first network error, wait for 15 seconds
     12818:20180111:095327.222 resuming SNMP agent checks on host "Switch A": connection restored
     12777:20180111:095330.514 resuming SNMP agent checks on host "Switch D": connection restored
     12782:20180111:095330.521 resuming SNMP agent checks on host "Switch B": connection restored
     12805:20180111:095330.521 resuming SNMP agent checks on host "Switch E": connection restored
     12793:20180111:095330.526 resuming SNMP agent checks on host "Switch C": connection restored
     12862:20180111:095333.203 cannot send list of active checks to "172.18.20.111": host [redmine] not found
     12850:20180111:095339.912 cannot send list of active checks to "172.18.81.159": host [network01] not found
     12850:20180111:095341.517 cannot send list of active checks to "172.18.81.162": host [lb02] not found
     11999:20180111:095341.981 sending configuration data to proxy "C4C proxy" at "185.33.38.74", datalen 146458
     12847:20180111:095345.367 cannot send list of active checks to "172.18.81.124": host [sql02] not found
     12856:20180111:095345.693 cannot send list of active checks to "172.18.20.112": host [vm06] not found
     12861:20180111:095349.260 cannot send list of active checks to "172.18.20.72": host [horizon] not found
     12817:20180111:095349.385 resuming SNMP agent checks on host "VC_PPD-1-1": connection restored
    What is annoying about those:
    1. I don't really have access to hosts (and ofc agents installed there) in lines about "cannot send list of active checks" and those aren't hosts I'm planning to monitor for now.
    2. PaaS proxy and all hosts connected through this proxy are disabled so I don't get why it's still getting checked.
    3. Only those switches and ocasionally other hosts from last hosts page are having connection problems and, by looking at log file, it happens basically every minute.

    I appreciate any advice and help. Thanks in advance.
Working...