Ad Widget

Collapse

Graphing Clock Drift: how to compare system.localtime to server time? (!fuzzytime)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gessel
    Junior Member
    • Mar 2017
    • 24

    #1

    Graphing Clock Drift: how to compare system.localtime to server time? (!fuzzytime)

    i can collect system.localtime[utc} and that's awesome. If I do I get some very nice data which lets me visually check just how close the clocks are (to a second or so) by comparing the "Last Check" value to the "Last Value" - cool: most are within a second or two (which I hope is true as we have NTP scripts running quite frequently). But sometimes the clocks still drift (virtualized servers drift horribly) and that breaks inter-system authentication and creates much sadness

    Within the db, there are two values: "Last Check" and "Last Value" optimally, I'd be able to subtract "last check" of system.localtime from "last value" (the time it was checked according to the zabbix server) as a Calculated Item and store that, I could then track clock drift and correlate various issues to +/- 1 second. Alas, I'll I've been able to find is to set up a calculated item as last("system.localtime[utc]")-last("NTP.server:system.localtime[utc]"), which does the expected, but since I'm updating/calculating every 30 seconds, the reported value is +/- 15 seconds. Certainly if the clock skewed more than 30 seconds, it would cause problems so I can set an alert for abs()>20 seconds, but i really want to know if the skew is more then 5 seconds, but also what that skew is. This also adds some unnecessary load as the clock isn't skewing in 30 seconds - more like an hour or two would be more than sufficient monitoring resolution.

    Fuzzytime is a nice function for a trigger, but it is only binary, for example, i can't graph the skew. I suppose i could create triggers for fuzzytime(1), fuzzytime(2), fuzzytime(4), fuzzytime(8), fuzzytime(16) or something like that, but that makes monitoring a bit cumbersome.

    There wouldn't happen to be either a way to create a collected item with postprocessing of "-last("NTP.server:system.localtime[utc]")" (so that the current time would be subtracted before storing the in the database) or perhaps a way to access the "last check" time value in a calculated value for something like last("system.localtime[utc]")-last("last.check.time[utc]")?

    Thanks!
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    First of all: permission of sampling time is limited to 1s in case of sampling time over zabbix agent. Why?
    Because it is minimum resolution of agent items history period. Whatever you will do as long as long you will be using zabbix agent and zabbix agent active items item.
    Second thing is that this precision worse in case of using passive (zabbix agent) items because initiation of the sampling data is done on the proxy/server and prx/srv poller communicates with agent, samples data and returns those data to server.
    If you want higher precision of the sampling local time you can do this but you must do this using trapper items because in those items is possible to pass not only time in seconds but second part in nanoseconds.
    As long as you will have sampled local time and time sampled from over (S)NTP you may do this subtraction to measure time skew.

    However this approach still will be a bit over complicated because whoever is interested about proper (S)NTP time synchronization will be interested how much local time is shifted to some stratum source.
    Another fact is as long as you have running process on monitored system which will be periodically checking and syncing time all what you really need to monitor is not monitoring time or time shift of local time to some (S)NTP server but monitor of this process .. is it running or not. This is simples approach to guarantee that time will be as much as it is only possible to shifted to loca (S)NTP setred time.
    In other words you don't need to monitor time but process
    If you want to measure this shift you may use ntp client to read this shift and save it in some zabbix item data.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • gessel
      Junior Member
      • Mar 2017
      • 24

      #3
      Hi Kloczek, this isn't quite what I'm looking for. A few problems can happen that make tracking the offset helpful:

      - VMs have really, really bad time sync, massive drift. Windows' regular sync rate is far too infrequent. I use a script to check every hour.

      - We're in a location where the outbound network is unreliable and can get routed in very different ways and occasionally is subject to some pretty gnarly flap. This makes sync to remote servers unreliable - we sync to a server on the LAN (as one would) and that syncs to global stratum. However, that server itself isn't perfectly reliable and... in fact... is running virtualized as well. That means it can drift and if the outside network goes out, drift rather meaingfully. When the NTP server does sync, the correction can be meaningful, and then when the local hosts sync to the now discontinuously corrected local NTP server, they become discontinuous with each other until all can sync.

      - this can then create some pretty random errors in logins. The errors are on the servers that ended up being out of sync with the management server (generally). Knowing that the management server or NTP server had a string of failed syncs would predict coming problems, but not tell me which of the servers are out of sync, or indeed if they actually are.

      - but if I can compare the clock of each host to the either Zabbix's clock (including the NTP server) or directly to the NTP server's clock, then I know which one is off and which one should be manually intervened with.

      I don't need msec accuracy. A second or two works. But "too" much (poorly defined in the protocol) and problems arise.

      And sure, using "fuzzytime(5)" is more or less sufficient to identify servers with problems, I can't track drift.

      Thing is, the DB has "last check time" and the matching "local time" right there. I can write an external SQL query to return the difference with a little bit of fuss and head scratching, but the data is so tantalizingly at hand it would be lovely if there were a simple command to render it.

      I suspect, given your history, that if there were such a thing you'd know about it and your answer suggests there is not so "system.localtime[utc].fuzzytime(5)=0" will have to suffice.

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        Originally posted by gessel
        Hi Kloczek, this isn't quite what I'm looking for. A few problems can happen that make tracking the offset helpful:

        - VMs have really, really bad time sync, massive drift. Windows' regular sync rate is far too infrequent. I use a script to check every hour.
        In such cases always is possible to forward syscalls about reading current time guest systems to to host systems.
        Depends on what kind virtualisation you are using details about how to do this are different.

        With this you may not need to run in each guest system ntpd.
        On Linux systems with systemd NTP client is now integrated even in systemd.

        - We're in a location where the outbound network is unreliable and can get routed in very different ways and occasionally is subject to some pretty gnarly flap.
        You can setup your local network local network NTP server and use it as reference.
        Using GPS signal receivers you can build quite chep you own even stratum 3 time source
        You can buy GPS USB dongle which can be used with your local ntp server really cheaply.

        [..]
        - this can then create some pretty random errors in logins. The errors are on the servers that ended up being out of sync with the management server (generally). Knowing that the management server or NTP server had a string of failed syncs would predict coming problems, but not tell me which of the servers are out of sync, or indeed if they actually are.

        - but if I can compare the clock of each host to the either Zabbix's clock (including the NTP server) or directly to the NTP server's clock, then I know which one is off and which one should be manually intervened with
        Nevertheless your job is not monitor time offset but to keep all systems with local time as close synced as it is only possible.
        Delivery/informing about time offset is not your business task. Have synced well/correctly everything it is what you must care.
        So again .. you should monitor those bits which are responsible for such synchronization.
        You may additionally monitor time offset to confirm that generally synchronization on all systems works as it should but it is not necessary.
        Using straight output of the ntpq/ntpstat command would be IMO easier/straightforward than sampling time on each system than using system.localtime[utc].fuzzytime in trigger.
        Using fuzzytime() assumes that you time source is on zabbix server. It may not be useful always.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • gessel
          Junior Member
          • Mar 2017
          • 24

          #5
          Hi Kloczek,

          Hmm... not sure it is quite appropriate to say what the job is. Tracking drift helps track down and potential resolve the cause. This is where data helps a lot. Zabbix is a great tool for collecting system data and this data has been essential in resolving quite a few vexing problems, some hardware, some software, some configuration. In this case, there is a hypothesis about login failures and having accurate clock drift data can help identify and resolve that issue. There's kind of this conceit that experts sometimes develop that they know the One True Path, when perhaps the full story is a little more nuanced.

          Sure, I can write a script to ssh in and return local time on the various OSes in place, compare that directly with the NTP server, call that as an external script, etc. But the zabbix client already grabs that item out of all the OSes we're using and all manner of different OSes can be tracked in a unified manner with very little effort.

          If you search on how to track clock skew between servers, you'll see this is not a unique request and drift on VMs is a well known problem. So while I appreciate the suggestions for mechanisms to verify that the NTP processes are running, or running a GPS time sync (we do have one on the network, BTW, but it is isolated to another time-of-flight correlation task with much, much higher accuracy requirements than +/- a few seconds), I'm asking for advice on how to solve a particular problem: get +/- 1 second (ish) accurate clock skew data stored as an item in the database. I suspect there are others that would find this useful. The usual answer is to use "fuzzytime" which is pretty close. It seems the internal comparison fuzzytime is doing and returning 1/0 is actually sufficient for this task. If fuzzytime could be called as a calculated result and had a units parameter like (seconds,s) where "seconds" is an integer test that returns the current binary output and "s" would return the floating point difference, that'd totally solve my need.

          Again, thanks for the advice. The answer is no different (so far) than others provided: trigger using fuzzytime and get a binary "problem/OK" output to within a specified absolute value delta, remember that internally the comparison is +/1 one second. If you need numerical drift data or more accuracy, the following links may be helpful (though having Windows machines, as always, complicates issues).

          https://unix.stackexchange.com/quest...ly-submit-jobs
          https://superuser.com/questions/4087...-linux-servers

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by gessel
            Hi Kloczek,

            Hmm... not sure it is quite appropriate to say what the job is. Tracking drift helps track down and potential resolve the cause.
            Yes. That is obvious. Issue only is that you are doing this using zabbix server time as reference. Usually systems are syncing time over (S)NTP so local (S)NTP server time should be used as reference.
            As long as your (S)NTP server is not on the same system where is running zabbix server you are adding additional error to all those systems which are measuring.
            Sure, I can write a script to ssh in and return local time on the various OSes in place, compare that directly with the NTP server, call that as an external script, etc. But the zabbix client already grabs that item out of all the OSes we're using and all manner of different OSes can be tracked in a unified manner with very little effort.
            Executing ntpq and extracting exact part of the output (using sed for example) does not need to be in the script. It can be done in short oneliner.
            You can use system.run[] key to execute such oneliner without writing and spreading any scripts across all systems.
            As long as you have already in agent settings EnableRemoteCommands=1 you can setup whole time NTP drift monitoring changing only zabbix monitoring configuration.

            Example. Long time ago I've been asked to add monitoring network retransmissions so I've added in my OS Linux template:
            • Name
              NET::segments retransmitted
            • Type
              Zabbix agent (active)
            • Key
              system.run["/bin/netstat -s|/bin/sed -n 's/\( *\)\(.*\) segments retransmitted*/\2/ p'"]
            • Type of information
              Numeric (float)
            (it is as float because in "processing" this item has "Change per second")
            Last edited by kloczek; 25-04-2018, 18:30.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • gessel
              Junior Member
              • Mar 2017
              • 24

              #7
              OMG, that's awesome. I hadn't messed with active triggers yet.

              I agree that using the zabbix host as a reference is slightly suboptimal. But I can set a trigger at a higher priority, say, if that host drifts.

              here's the windows command:
              >w32tm /stripchart /dataonly /samples:1 /computer:10.100.50.150 (fill in as appropriate)
              Tracking 10.100.50.150 [10.100.50.150:123].
              Collecting 1 samples.
              The current time is 4/25/2018 4:14:35 PM.
              16:14:35, +05.3312011s

              for Linux
              $ ntpq -p
              remote refid st t when poll reach delay offset jitter
              ================================================== ============================
              +pfSense.pmcam 62.201.225.9 3 u 729 1024 377 0.135 2.705 1.356
              *time.iqnet.com 62.201.214.162 2 u 119 1024 357 6.629 -0.041 7.755
              +node62.ia64.org 105.100.222.35 2 u 244 1024 377 185.534 8.559 15.684


              executing the one line script on the remote host to extract the drift is cool on Linux/BSD machines, but harder on windows (especially Win7, server might have some utilities).

              For Linux/BSD hosts though, one can $ sudo apt-get install iputils-clockdiff and then
              $ clockdiff 10.100.50.150
              .
              host=10.100.50.150 rtt=750(187)ms/0ms delta=1ms/1ms Wed Apr 25 17:10:35 2018

              from the zabbix host (obviously using the zabbix host as a reference). This doesn't work against windows machines though.

              Seems doable...

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Originally posted by gessel
                executing the one line script on the remote host to extract the drift is cool on Linux/BSD machines, but harder on windows (especially Win7, server might have some utilities).
                I'm 100% sure that you will find package in cygwin with ntpq so even opn Win the same method can be used.
                Last edited by kloczek; 25-04-2018, 18:28.
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • kloczek
                  Senior Member
                  • Jun 2006
                  • 1771

                  #9
                  Originally posted by gessel
                  OMG, that's awesome. I hadn't messed with active triggers yet.
                  system.run[] key it is the agent key ... doesn't matter active or passive one.
                  https://www.zabbix.com/documentation...s/zabbix_agent
                  It is only "coincidence" that I'm using only active agents setup
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment

                  • gessel
                    Junior Member
                    • Mar 2017
                    • 24

                    #10
                    Oh, you're right of course, I was kinda hoping to avoid doing anything special on the clients, but cygwin is a viable solution. Meinberg seems to have some nice tools too. I'll try with the simple fuzzytime check for now. If it is sufficient, I'm OK with it.

                    Comment

                    Working...