Ad Widget

Collapse

icmppingsec / icmpping issue - fake/false alerts generated.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zabbixfk
    Senior Member
    • Jun 2013
    • 256

    #1

    icmppingsec / icmpping issue - fake/false alerts generated.

    Hello,

    I am having tough time in getting this icmppingsec/icmpping work for the added hosts. There's this weird issue where for hosts zabbix reports its not reachable - via alert - where, when i right click and select ping from the dashboard host seems to be reachable.

    Problem statement : Checking device reachability on zabbix via icmppingsec or icmpping. The reason for icmppingsec is also return response time in seconds.

    utility installed : fping - all permissions proper.

    Item:
    Code:
    Key : icmppingsec[,5,,,300,]
    Type of information: foat
    Units: s
    Update Interval: 180
    History : 7
    Trends: 180
    This item config seems to be working where , i am able to get the ping response time in case of icmppingsec and value 1/0 in case of icmpping.

    Trigger:
    Code:
    {WAN-IP-Ping:icmppingsec[,5,,,300,].last(0)}=0
    Problem is : For hosts, all of sudden, zabbix says hosts are down - i.e either icmppingsec or icmpping value is zero. But on next iteration of icmpingsec, its proper i.e not zero and trigger becomes OK from PROBLEM.
    During the alert, if i check the reachablity, device seems reachable , and on the device, BGP seems fine so no flaps.

    How do i debug these kind of issues? All in zabbix logs i see this,

    Code:
     88010:20180116:162256.645 In add_icmpping_item() addr:'172.X.X.X' count:5 interval:0 size:0 timeout:300
     88010:20180116:162256.646 In add_pinger_host() addr:'172.X.X.X'
     88010:20180116:162256.646     172.X.X.X
     88010:20180116:162301.049 read line [172.X.X.X : - - - - -]
     88010:20180116:162301.050 host [172.X.X.X] cnt=5 rcv=0 min=0.000000 max=0.000000 sum=0.000000
    How do i check these kind of issue? From the zabbix perspective, zabbix could not reach host as per logs but from WAN reachablity perspective, device wasn't gone down (no bgp flaps) or no problem in reaching via ping

    PS:
    1).Would increasing packet count from 5 to 10 help?
    2).Would decreaing interval from 3mintue to 1 minute help? - i know this would choke network and need tuning on zabbix end. I have close to 3.5K devices to poll and about 2 proxies plus one server doing that.

    Server Conf:
    Code:
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=300
    DebugLevel=4
    PidFile=/var/run/zabbix/zabbix_server.pid
    DBName=zabbix
    DBUser=XXXXX
    DBPassword=XXXXX
    StartPollers=185
    StartIPMIPollers=1
    StartPollersUnreachable=75
    StartTrappers=65
    StartPingers=95
    StartDiscoverers=1
    StartSNMPTrapper=1
    ListenIP=0.0.0.0
    HousekeepingFrequency=2
    MaxHousekeeperDelete=300
    SenderFrequency=360
    CacheSize=1G
    CacheUpdateFrequency=300
    StartDBSyncers=15
    HistoryCacheSize=256M
    HistoryIndexCacheSize=256M
    TrendCacheSize=1G
    ValueCacheSize=128M
    Timeout=30
    TrapperTimeout=180
    UnreachablePeriod=600
    UnavailableDelay=180
    AlertScriptsPath=/etc/zabbix/alert.d/
    FpingLocation=/usr/local/sbin/fping
    LogSlowQueries=300
    StartProxyPollers=2
    ProxyDataFrequency=180
    Proxy1
    Code:
    Server=172.X.X.X
    Hostname=zbx-proxy1
    LogFile=/var/log/zabbix/zabbix_proxy.log
    LogFileSize=325
    DebugLevel=4
    PidFile=/var/run/zabbix/zabbix_proxy.pid
    DBName=zabbix
    DBUser=XXXXXX
    DBPassword=XXXXXXX
    ProxyLocalBuffer=4
    ProxyOfflineBuffer=4
    ConfigFrequency=120
    DataSenderFrequency=10
    StartPollers=200
    StartPollersUnreachable=100
    StartTrappers=45
    StartPingers=85
    StartSNMPTrapper=1
    CacheSize=1G
    StartDBSyncers=40
    HistoryCacheSize=1G
    HistoryIndexCacheSize=128M
    Timeout=30
    UnreachablePeriod=80
    FpingLocation=/usr/local/sbin/fping
    LogSlowQueries=3000
    Proxy2
    Code:
    Server=172.X.X.X
    Hostname=zbx-proxy2
    LogFile=/var/log/zabbix/zabbix_proxy.log
    LogFileSize=300
    DebugLevel=4
    PidFile=/var/run/zabbix/zabbix_proxy.pid
    DBName=zabbix
    DBUser=XXXXX
    DBPassword=XXXXX
    ProxyLocalBuffer=3
    ProxyOfflineBuffer=4
    ConfigFrequency=120
    DataSenderFrequency=30
    StartPollers=215
    StartPollersUnreachable=85
    StartTrappers=40
    StartPingers=80
    StartSNMPTrapper=1
    HousekeepingFrequency=3
    CacheSize=1G
    StartDBSyncers=50
    HistoryCacheSize=1G
    HistoryIndexCacheSize=1G
    Timeout=30
    UnreachablePeriod=90
    FpingLocation=/usr/local/sbin/fping
    LogSlowQueries=300
    Any pointers on how do i bring down these kind of alerts are greatly helpful.

    Thanks
  • allexpetrov
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • May 2017
    • 361

    #2
    Hi there,

    Which Zabbix Server version you are using? Actually, your trigger is kinda sensitive and will be in Problem state on every last=0, which in networking case could be ok.

    Try to use min/max function for last 3-5mins or 3-5 checks to make trigger less sensitive and, in case you are using the Zabbix version below 3.2 - thin about using the Hysteresis.

    Regards,
    Aleksejs!

    Comment

    • zabbixfk
      Senior Member
      • Jun 2013
      • 256

      #3
      icmppingsec / icmpping issue - fake/false alerts generated.

      Thanks for the reply.

      I am using Zabbix server 3.0.13, and same version of proxies.

      - I have to make it last()=0, as trigger , because the devices i monitor are critical in checking reachality. In any case if it goes down i need to know first before site, that's the idea and hence trigger set as last()=0.

      - Problem here is , zabbix only says device gone down at that moment (when its checked ), where as by right clicking on the dashboard device seems reachable. In other words, on the next poll device becomes reachable.

      I would need suggestion how to make these alerts go away
      - Should the increase in packet count help? ( instead of 5 packets , should i use 8 or 10)
      - Would it be helpful if i reduce the polling timings from 3minute to 1minute?
      - What would be the impact on monitoring in case if the polling is reduced to 1min?

      Anyone know when icmppingsec / icmpping says 0 , how's the result arrived at? From the manuals, it says if the timeout reached then icmpingsec returns zero. I queite don't underatand is ,
      - in case of 5 packets and 300ms timeout , what possiblietes of result being zero
      - all 5 packets failed? all 5 reached more than 300ms timeout? portion of packets failed?

      - Is there any way, where in , in case if the item (icmppingsec/icmpping) results 0 , immediately another poll happens and if that's zero then only trigger gets raised ( only for devices which results 0 for first ping - and not wait for another poll as per template - coz template has 3min polling interval) - not for all - and then based on this trigger an email action is initiated?

      Any pointers to solve this greatly helpful.
      Thanks.

      Comment

      • zabbixfk
        Senior Member
        • Jun 2013
        • 256

        #4
        icmppingsec / icmpping issue - fake/false alerts generated.

        *bump*.

        I am kind of stuck in-figuring out what's happening for some of the hosts.
        Can someone reply how does icmppingsec works, what's the best configuration to be applied?

        P.S: If i want to check last 2 values being zero, would max(#2) or min(#2) help or should i use last(#1)=0 and last(#2)=0

        Thanks

        Comment

        • zabbixfk
          Senior Member
          • Jun 2013
          • 256

          #5
          I was able to do it with last(#2)=0 and last(0)=0.

          But new problem is, devices saying last values 0,0,1,1,1 OR 0,0,13ms,18ms etc.
          Looks like some underlaying issue w.r.to. icmppingsec/icmpping.
          Now because the trigger says last two values ( last(#2)=0 and last(0)=0 ) i am seeing last two values becoming zero despite the devices are getting reached via command line or via dashboard ping utility.

          I am getting to to loose the trust in the tool now

          Thanks.

          Comment

          Working...