Ad Widget

Collapse

Performance gains by using prime numbers for update interval(?)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Markus
    Member
    • Jan 2006
    • 39

    #1

    Performance gains by using prime numbers for update interval(?)

    Hi everybody!

    Today I developed a _theory_ about how to hopefully achieve performance gains by selecting prime numbers for the update interval ('delay') of monitored items. Please let me know what you think as so far this is just a theory! I don't have a large enough Zabbix installation to try it out in practice...

    Background: The update interval determines how often an item is being checked by the Zabbix server (ignoring active checks here!). Commonly the update interval values is set to multiples of 30 seconds, i.e. 30,60,120 or 300 seconds. This effectively means, that frequently many checks have to be executed at the same time. For instance, if you had 5 checks at a 30 seconds interval and 5 checks at a 60 seconds interval, every 60 seconds 10 checks are executed. Furthermore there are 'lulls' of 30 seconds where not much happens. Please correct me, if I am wrong here!!!

    My idea is to spread out the checks in a way so that checks are still performed at regular intervals but not too many at the same time. In order to achieve this, one has to choose intervals for the individual checks so that their Least Common Multiple (LCM) is as large as possible. In the example above the 10 checks (5 x 30secs and 5 x 60 secs) share an LCM of 60 (2*30=1*60). So every 60 seconds all checks are run. A better choice would be to choose the following intervals for the ten checks, each of which is a prime number: 23,29,31,37,41,47,53,59,61,67,71. Since the LCM of two prime numbers is their product, only after 667 (=23*29) seconds any two of the ten checks will be executed at the same time and otherwise they are nicely spread out over time.

    That's the theory anyway. Am I completely wrong here? I wonder whether anybody is willing to try this out on a real Zabbix server. If you prove me wrong - fine no problem
    Would be nice for everybody if I were right though

    Markus
  • cameronsto
    Senior Member
    • Oct 2005
    • 148

    #2
    This is an interesting theory. I might be able to try this out later today. This also might explain why one of my hosts is having issues in regards to responding to snmp requests. I see the requests come through in the snmpd logs, but the zabbix_server log says that it times out waiting for a response. I'm wondering if snmpd is able to respond to the first few requests but isn't able to keep up with the frequent requests.

    Anyway, I'll let you know what I find.

    -cameron

    Comment

    • edeus
      Senior Member
      • Aug 2005
      • 120

      #3
      Now if you could only come up with a solution to break down numbers into their prime components quickly

      Comment

      • Markus
        Member
        • Jan 2006
        • 39

        #4
        Performance gains by using prime numbers for update interval(?)

        Originally posted by edeus
        Now if you could only come up with a solution to break down numbers into their prime components quickly
        Well, I am not trying to break public-key cryptography here. If you just need a list of prime numbers, look here..

        Comment

        • elkor
          Senior Member
          • Jul 2005
          • 299

          #5
          well, essentially you are correct.

          the issue however is procedural and arises when you take maintainence of a large environment into account (particuarly if you are not the one who maintains the environment after deployment)

          explaining this to a number of admins or operators and asking them to maintain a rotating prime number delay scheme is already making me bang my head on a desk. many people simply don't have that level of attention to detail.

          In addition, this scheme is simple enough to understand with 10 items for a given host. What if each host has 30 items? you would potentially need to reuse primes in order to keep your checks within your tolerence window. AND, chances are there would probably be a lot of concurrency between hosts.

          All that being said, and returning to my original statement, you're right.. it would reduce concurrency... just maybe not enough to deal with the complexity of it

          Comment

          • Markus
            Member
            • Jan 2006
            • 39

            #6
            Performance gains by using prime numbers for update interval(?)

            You are of course right that one would have to re-use prime numbers and I fully agree that the practical implications are an issue here. For most people it probably is not a real issue anyway since Zabbix seems pretty fast and hardware is relatively cheap nowadays. Maybe one day I will have responsibility for a big enough Zabbix installation so I can try out my theory in practice. On the other hand it might simply be cheaper to spend a few (Australian) dollars on bigger hardware instead.

            Markus

            Comment

            • edeus
              Senior Member
              • Aug 2005
              • 120

              #7
              I have a Pentium 866 doing my monitoring. This copes with about 25 servers@30 items each.

              Would be interesting to see how extensively they use zabbix. Post your stats!

              Comment

              • Alexei
                Founder, CEO
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Sep 2004
                • 5654

                #8
                I like that idea in general. However I do not see how this can be implemented for large number of items.

                A better approach could be advanced calculation of initial timestamp for the next refresh in order to minimize spikes. Currently next check is guaranteed to have (timestamp mod refresh_time == 0).
                Alexei Vladishev
                Creator of Zabbix, Product manager
                New York | Tokyo | Riga
                My Twitter

                Comment

                • SAT QPass
                  Member
                  • Oct 2005
                  • 61

                  #9
                  What about introducing a bit of randomness into the timestamp. Rather than say it is a perfectly fixed interval, it becomes a guideline (say tolerance of +-5%) or some other suitable variance which would allow for better spike management. You could simply achieve this with the application of RAND on the check interval and a multiplier. You could even have the tolerance value a db/config value such that a those that must/want perfect intervals could have them.

                  What do you think?

                  Comment

                  • edeus
                    Senior Member
                    • Aug 2005
                    • 120

                    #10
                    By using these odd values, would this cause a lot of load for graphing?

                    Perhaps not, but I am not sure how the graphing is coded. Variable -/+ is a good idea though.

                    Comment

                    • Markus
                      Member
                      • Jan 2006
                      • 39

                      #11
                      This is similar to what I am currently planning to implement for a potentially large scale deployment of Zabbix.

                      In fact it's going to be a mix of the 'randomness' and 'prime numbers' methods suggested in this thread. For example instead of using a 300 second interval (5 min) for all items I thought of randomly picking values out of a set including 271, 277, 281, 283, 293, 307, 311, 313, 317 (ca. 4:30mins - 5:30 mins).

                      Sometime later today I will try to analyse whether either the 'random' method mentioned by SATQPass or the one I described above produce fewer concurrent checks.

                      Edeus might have a valid point though by wondering whether any such scheme might actually have a negative impact on performance because it might clash with the way Zabbix is designed. For instance, how would Zabbix (and the database backend) cope if it had to run a few checks every second instead lots of them every 30 seconds? Furthermore, maybe we are wasting our time here on something which would bring only small improvements while other changes, e.g. active checks instead of passive ones, might give much better performance improvements.

                      I hope, that sometime later in February I will have the chance to apply all this to a real Zabbix installation. I will post the results...

                      Markus

                      Comment

                      • Alexei
                        Founder, CEO
                        Zabbix Certified Trainer
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Sep 2004
                        • 5654

                        #12
                        Thanks for the ideas. I'm going to implement the following schema:

                        1. When an item is created nextcheck (next time to refresh) will be set to itemid%3600
                        2. Next refresh time will be calculated as, simplified, (itemid%3600+N*delay)

                        This will ensure some ramdomness and will help to eliminate possible peaks.
                        Alexei Vladishev
                        Creator of Zabbix, Product manager
                        New York | Tokyo | Riga
                        My Twitter

                        Comment

                        • Markus
                          Member
                          • Jan 2006
                          • 39

                          #13
                          Just a quick followup: the 'random' method suggested a few posts earlier in this thread is possibly the way to go. I did some quick analysis and it distributes the test much better over time than my 'prime numbers' concept.

                          I concede that my initial idea was defeated but I still win by getting a better Zabbix anyway

                          Markus

                          Comment

                          • SAT QPass
                            Member
                            • Oct 2005
                            • 61

                            #14
                            Originally posted by Markus
                            In fact it's going to be a mix of the 'randomness' and 'prime numbers' methods suggested in this thread. For example instead of using a 300 second interval (5 min) for all items I thought of randomly picking values out of a set including 271, 277, 281, 283, 293, 307, 311, 313, 317 (ca. 4:30mins - 5:30 mins).
                            I think statistically you will see a more even distribution if you use a mechansim similar to the one I suggested. Ideally it would also be configurable to allow for more or less tolerance which would probably be directly linked to the size of the monitored pool.

                            Originally posted by Markus
                            Edeus might have a valid point though by wondering whether any such scheme might actually have a negative impact on performance because it might clash with the way Zabbix is designed. For instance, how would Zabbix (and the database backend) cope if it had to run a few checks every second instead lots of them every 30 seconds? Furthermore, maybe we are wasting our time here on something which would bring only small improvements while other changes, e.g. active checks instead of passive ones, might give much better performance improvements.

                            I will post the results...
                            I think that is a very valid point, how well does zabbix handle continuous tests vs. spikes of tests and would definitely require a bit of analysis. Obviously as you correctly observed, active checks transfer the load to the clients, however, I am personally striving to minimize the impact on the clients as much as possible (and just put a robost machine on the zabbix side).

                            I myself am running zabbix against several hundred hosts with nearly a 60 tests per node (that is just the baseline). Some hosts have far more.
                            Last edited by SAT QPass; 14-02-2006, 23:52. Reason: Fixed some BB code blocks.

                            Comment

                            • SAT QPass
                              Member
                              • Oct 2005
                              • 61

                              #15
                              Originally posted by Alexei
                              Thanks for the ideas. I'm going to implement the following schema:

                              1. When an item is created nextcheck (next time to refresh) will be set to itemid%3600
                              2. Next refresh time will be calculated as, simplified, (itemid%3600+N*delay)

                              This will ensure some ramdomness and will help to eliminate possible peaks.
                              I am curious how this will work... I am not a C programmer, so forgive me if I am mistaken, but would not the value of itemid%3600 = itemid every time unless itemid>3600? And what is the value of N? And therefore my next question is this whole seconds or ms? I currently have a itemids of >100 and I am not even done listing out my complete test template.

                              Assuming the above is true, as itemid increases, distribution increases proportionally.

                              Again, I am not a programmer, and I may have this all turned around.

                              Comment

                              Working...