Ad Widget

Collapse

SNMP uptime overflow after 497 days

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Ultrasonic
    Junior Member
    • Sep 2015
    • 8

    #1

    SNMP uptime overflow after 497 days

    Hello.
    From my observation I see that Zabbix have overflow problem with high uptime values on SNMP devices. Below values where ovrflow occurs.

    2016-08-16 18:47:14 217
    2016-08-16 18:46:14 157
    2016-08-16 18:45:14 97
    2016-08-16 18:44:14 36
    2016-08-16 18:43:14 42949650
    2016-08-16 18:42:14 42949590
    2016-08-16 18:41:15 42949530

    After 497 days of uptime, counter overflows and zabbix reporting "device reboot" alarm.

    Is tere any workaround? I have many devices where overflow occurs often (devices with more than 1500 days uptime) and I have many false alarms.
  • andris
    Zabbix developer
    • Feb 2012
    • 228

    #2
    Seems like somewhere uptime is stored as 32-bit number which is too small for long uptimes. Going to 64-bits could be a solution.
    Does your monitored SNMP device report correct uptime after 497 days with other tools ? Or is it only 32-bit aware ? Is your Zabbix 64-bit ?

    Comment

    • kloczek
      Senior Member
      • Jun 2006
      • 1771

      #3
      Originally posted by Ultrasonic
      2016-08-16 18:43:14 42949650
      $ echo 42949530; echo "2^32" | bc
      42949530
      4294967296

      From http://www.alvestrand.no/objectid/1.3.6.1.2.1.1.3.html
      Code:
       OID description:
      
      sysUpTime OBJECT-TYPE
                    SYNTAX  TimeTicks
                    ACCESS  read-only
                    STATUS  mandatory
                    DESCRIPTION
                            "The time (in hundredths of a second) since the
                            network management portion of the system was last
                            re-initialized."
                    ::= { system 3 }
      So looks like everything is OK
      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment

      • Ultrasonic
        Junior Member
        • Sep 2015
        • 8

        #4
        I don't know what is this zabbix platform 64 or 32 bit - I have not administration privileges on platform cluster. Zabbix version is 2.4.7

        It looks like 32 bit values problem, but what we can do with it?

        I see only "numeric unsigned", "numeric float", or "log","text" values in zabbix item configuration.

        There is mostly cisco devices.

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Originally posted by Ultrasonic
          I don't know what is this zabbix platform 64 or 32 bit - I have not administration privileges on platform cluster. Zabbix version is 2.4.7

          It looks like 32 bit values problem, but what we can do with it?

          I see only "numeric unsigned", "numeric float", or "log","text" values in zabbix item configuration.

          There is mostly cisco devices.
          It has nothing to do with zabbix.
          I've quoted the definition of the sysUpTime from SNMPv2 MIB.You cannot read from the SNMP agent data which such agent does not provide.
          If you see that in the monitoring of another device uptime provided over SNMP is not affected by 497.1 days max interval it means that your monitoring is reading not SNMPv2-MIB::sysUpTime OID. Which one exactly? You can check it in your monitoring.
          IIRC in other Cisco-specific MIB is a definition of the OID stored in 64bit counter as the number of seconds. Just try to google
          Last edited by kloczek; 18-08-2016, 15:32.
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • Ultrasonic
            Junior Member
            • Sep 2015
            • 8

            #6
            You have right, i came to this.

            Cisco have another OID where store uptime in seconds (instead seconds*100), but it requires SNMP-FRAMEWORK-MIB supported in device (snmpEngineTime at OID .1.3.6.1.6.3.10.2.1.3).

            Unfortunately my devices doesn't have SNMP-FRAMEWORK-MIB supported

            Comment

            • kloczek
              Senior Member
              • Jun 2006
              • 1771

              #7
              As long as this device not been rebooted and probably firmware upgrades not have been done as well you should check latest versions of the firmware .. maybe they've added support for this MIB.

              BTW: I would be very worry having so long not restarted devices. After so long time probability that for some reasons such device would be not able to boot correctly could be above unacceptable level.
              Sometimes even almost-faulted cooling fun by stopping it and starting going over power cycle may be found as now-it-is-faulted state
              Always better is to find such problems during working hours instead be wake up in the middle of the night :P
              If you have full redundancy of the network infrastructure time to time failing over to standby devices to perform full reboot with power cycle should part of the normal operation procedures. I have in all my templates uptime mapped to inventory records to have single simple view allowing to identify systems/devices with longest uptimes.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment

              • Ultrasonic
                Junior Member
                • Sep 2015
                • 8

                #8
                I could reboot the routers if it were my - unfortunately it is owned by a government agency and to nod your finger there is a need council meeting, approval, etc.

                In fact, it does not need to track uptime for these devices, but it is important information that there was a restart. Currently I am looking for another indicator that occurs every cisco and reflects the restart.

                Comment

                • andris
                  Zabbix developer
                  • Feb 2012
                  • 228

                  #9
                  Device reboot exactly after 497.1 days is extremely unlikely.
                  So, as a workaround you could compare 2 latest uptime values.
                  If uptime has decreased AND it was less than (497.1 - small value, depends on how often you poll device) days before decrease, then a reboot took place.
                  Otherwise no reboot, just counter overflow.

                  Comment

                  • Ultrasonic
                    Junior Member
                    • Sep 2015
                    • 8

                    #10
                    Good idea
                    I made little modification of standard SNMP trigger "{HOST.NAME} has just been restarted":


                    Expression:

                    {hostname_xxx:sysUpTime.change(0)}<0 and {hostname_xxx:sysUpTime.prev()}<42949000


                    42949000 is near max allowed uptime value.

                    I hope this should works...

                    Comment

                  • syntax53
                    Member
                    • Mar 2018
                    • 40

                    #11
                    Stumbled onto this post researching a similar issue with a device. The default template, "Template Module Generic SNMPv2" has a trigger for "{HOST.NAME} has been restarted" with a value of "{Template Module Generic SNMPv2:system.uptime.last()}<10m". So it's not looking at .change, but rather an uptime of less than 10 minutes. I have modified the trigger as follows:

                    Code:
                    {Template Module Generic SNMPv2:system.uptime.last()}<10m and ({Template Module Generic SNMPv2:system.uptime.max(660)}<4294307 or {Template Module Generic SNMPv2:system.uptime.max(660)}>4294997)
                    I believe this will stop the false alerts for 32 bit values but still allow them on 64 bit values. The upper limit on a 32-bit unsigned int is 4,294,967,295. The last thousandths of that number (0-999) are used as the fractions of seconds. So the upper limit in seconds is 4,294,967. Minus 600 seconds (10 minutes) would be 4,294,367. I subtracted an extra 60 seconds for wiggle room which is were the 4294307 comes from. Likewise, I added 30 seconds to the max of 4294967 to get 4294997 for the upper limit. So only if a device happens to reboot within that 11 minute and 30 second window would it get missed.

                    I haven't actually tested this, but it looks good

                    Comment

                    • kloczek
                      Senior Member
                      • Jun 2006
                      • 1771

                      #12
                      More than year without any firmware upgrades ...
                      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                      https://kloczek.wordpress.com/
                      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                      My zabbix templates https://github.com/kloczek/zabbix-templates

                      Comment

                    • syntax53
                      Member
                      • Mar 2018
                      • 40

                      #13
                      If it's not insecure and it ain't broke, don't fix it.

                      Comment

                      • kloczek
                        Senior Member
                        • Jun 2006
                        • 1771

                        #14
                        Originally posted by syntax53
                        If it's not insecure and it ain't broke, don't fix it.
                        Dos't matter.
                        Keeping long uptime is nothing more than asking for troubles.
                        I saw to many times in the past computers, routers or even switches not been able to work after full power cycle (usually by final failing of the bearings in in sining parts like disks of cooling fans).
                        In many environments people are using monitoring to observe uptimes to take actions (automatic, semi automatic or manual) to perform at least full power cycle if not system reinstall to allow finally fail something just when all ops are around to handle such failure quickly instead in the middle of the night or when some people are on holiday.
                        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                        https://kloczek.wordpress.com/
                        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                        My zabbix templates https://github.com/kloczek/zabbix-templates

                        Comment

                        • syntax53
                          Member
                          • Mar 2018
                          • 40

                          #15
                          Originally posted by syntax53
                          Stumbled onto this post researching a similar issue with a device. The default template, "Template Module Generic SNMPv2" has a trigger for "{HOST.NAME} has been restarted" with a value of "{Template Module Generic SNMPv2:system.uptime.last()}<10m". So it's not looking at .change, but rather an uptime of less than 10 minutes. I have modified the trigger as follows:

                          Code:
                          {Template Module Generic SNMPv2:system.uptime.last()}<10m and ({Template Module Generic SNMPv2:system.uptime.max(660)}<4294307 or {Template Module Generic SNMPv2:system.uptime.max(660)}>4294997)
                          I believe this will stop the false alerts for 32 bit values but still allow them on 64 bit values. The upper limit on a 32-bit unsigned int is 4,294,967,295. The last thousandths of that number (0-999) are used as the fractions of seconds (e.g. milliseconds). So the upper limit in seconds is 4,294,967. Minus 600 seconds (10 minutes) would be 4,294,367. I subtracted an extra 60 seconds for wiggle room which is were the 4294307 comes from. Likewise, I added 30 seconds to the max of 4294967 to get 4294997 for the upper limit. So only if a device happens to reboot within that 11 minute and 30 second window would it get missed.

                          I haven't actually tested this, but it looks good
                          I had to modify this trigger because I found one device that seems to only use the last 2 digits of the integer for milliseconds, so it rolled over at 42949672 instead of 4294967. Modified trigger as follows:
                          Code:
                          {Template Module Generic SNMPv2:system.uptime.last()}<10m
                          and ({Template Module Generic SNMPv2:system.uptime.max(660)}<4294307 or {Template Module Generic SNMPv2:system.uptime.max(660)}>4294967)
                          and ({Template Module Generic SNMPv2:system.uptime.max(660)}<42949012 or {Template Module Generic SNMPv2:system.uptime.max(660)}>42949672)

                          Comment

                        Working...