Ad Widget

Collapse

Interesting Zabbix puzzle - recursive items

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dampersand
    Junior Member
    • Apr 2016
    • 16

    #1

    Interesting Zabbix puzzle - recursive items

    Hi all,

    I have an interesting puzzle for your perusal - if it works, it can easily go in the Zabbix cookbook.

    Summary
    I have a calculated item "Consumption" that is doing calculations using its own previous data, that is, the item is calling itself along with other items. If ever a dependency of Consumption becomes unsupported and resupported, Consumption will go unsupported and STAY unsupported. This is because when it tries to access its own previous data (ie, max("Consumption",86400)), Consumption recognizes ITSELF as a dependency that is unsupported... and doesn't do the calculation. Can I force it to ignore itself as a dependency?

    Current Situation
    I'm using Zabbix as a drop-in replacement for cacti. It's taken a little setting up, but it's much more robust - and our calculations are much more accurate.

    One of the tasks I ran up against was to figure out the bandwidth consumption out of a port since a given time. Namely, I want to know how much data is consumed on a monthly basis - not a 30 day basis, a MONTHLY basis. Month-to-date. Week-to-date.

    This means I can't just chart delta SNMP on ifHCInOctets and ifHCOutOctets, then sum them up, since there's no easy way to select the sum time range.

    I'd settled on charting exact values of ifHCInOctets and ifHCOutOctets, and using that to track absolute consumption. I used one item to chart these values once on the beginning of every month, one item for once on the beginning of every week, then one item for the current consumption. I could then subtract 'beginning of the month' from 'current' and get month-to-date. Success!

    ...except that every time a router is restarted, the 'current' resets back to zero. Crud!

    My current solution, then, is to have an item that charts the SIMPLE CHANGE in ifHCInOctets - which will always be positive except for the FIRST poll after a restart - and continually sum that simple change, ie:

    Code:
    Item key: "Consumption"
    Formula: max("Consumption",86400) + last("ifHCInOctets.Delta",0)
    As you can see, "Consumption" actually polls itself, looking for its last maximum. In this way, if ifHCInOctets ever drops to 0 because of a reboot, "Consumption" won't notice - ifHCInOctets.Delta will, and will fail for only 30 seconds, which is not long enough for Consumption to go unsupported.

    The Problem

    Recently, we had a situation where a router had the wrong SNMP credentials. This is long enough that ifHCInOctets.Delta went unsupported... and caused Consumption to go unsupported.

    After the problem was fixed, ifHCInOctets.Delta came back online... but Consumption did not. The reason?

    Consumption is 'unsupported' because Consumption is 'unsupported' - namely, "Consumption couldn't 'support' itself, because it depended on itself, and it was already 'unsupported,' so Consumption would quit then and there (without even trying to make the calculation). The exact error message is gone, but it was something along the lines of "Could not find max(86400), item "Consumption" not supported."

    The Request

    I feel like recursive Consumption is a bit of a janky way to solve my problem, but I haven't found a better solution. One of these three things could help:

    1. Is there a way to skip the 'not supported' check, that is, force Consumption to do the math on a 'not supported' item?
    2. Is there a way to add conditionals to the formula, that is, "if X is not supported, formula = 1, else formula = max("Consumption") + etc?
    3. Is there a better way to get data-by-month? Not 30-day or 7-day data, mind you, but data-by-month (sometimes 31, sometimes 28 days)?

    Thanks!
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    In other words your "consumption" it is "how much data has been transferred in given period?". Is that true?
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • dampersand
      Junior Member
      • Apr 2016
      • 16

      #3
      Yep, that's correct.

      Drives me nuts when people say "bandwidth" and mean either "data per second" or "data." Ambiguity is the worst.

      ifHCInOctets measures bytes used since last reset. Consumption is really just ifHCXOctets mapped to time points, and without ever resetting to zero... If you've got the hard drive space for that, you can calculate anything.

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        So all what you need is instead storing ifHCInOctets and ifHCOutOctets as raw counters values just store speed per second of those OIDs and do integral of those speeds in given time period.

        Such integration may be a bit tricky because after item "History storage period" or global housekeeping history period you have no raw data in zabbix database.
        In such case integration formula still can be calculated in arbitrary time period by doing integration of avg trends data up to last full hour + integration of last not full hour data from history data.
        Such integration will be even fastest possible because from computation point of view it will be using less points in numerical integration.

        And I think that such integration as new function may be kind of generic calculated item function. It may be even it two variants. First like above ad second only using raw history data.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • dampersand
          Junior Member
          • Apr 2016
          • 16

          #5
          Thanks for your suggestion, but...

          I fail to see how to integrate over a specified arbitrary period with any sort of automation. I can integrate over 30 days, or 7 days, sure, but how can I say "integrate over a continually increasing time frame starting at October 1st, stopping at October 31st, store that data, then start again November first etc?" Ie, show month to date consumption, maxing out not after 30days, but after 31... Followed by 30... Followed by 31, etc?

          The end goal, to be clear, would be to have a single item to out on a screen that says "data used so far this month" that always maxes out on the last day of the month. My recursive trick does this, but any time a router goes out, it fails and needs manually reinitialized... Not often, but a real pain.

          The total bandwidth method you suggest is the same that cacti uses, and when using that method, I had to hand-zoom 150+ different graphs to get total consumption used... Every month. That's one of the reasons we DROPPED cacti.

          If, once per hour, I could integrate over that hour and store the data, sure, but that would fall into the max-of-yourself-plus-delta scheme that has the recursion issue.
          Last edited by dampersand; 14-08-2016, 20:32.

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Obtaining raw trends/history data between current time and arbitrary data is easy.
            However on presentation layer I would be more happy to have derivative function as well. Derivative of the speed delivers or second derivation of the counter delivers trends.
            https://en.wikipedia.org/wiki/Second_derivative

            More important in you case is answer on question for what you need present your counter exactly this way? and/or what exactly you want to "squeeze" from data presented that way ? because maybe there are more effective ways to produce some interesting you facts ..
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • dampersand
              Junior Member
              • Apr 2016
              • 16

              #7
              Integrating between an arbitrary time and now is easy if you have a symbolic equation, or if you have a sheet of numbers (using a glorified Riemann sum)...

              ...but how is it easy in Zabbix? You keep saying it's easy, but that's what I said until I tackled the problem, too. It IS mathematically easy - give me some numbers, and I'll Riemann-sum them out. It's NOT Zabbix-a-matically easy.

              I mean, like, I could make a million MySQL calls if I wanted, but I want this displayed in Zabbix.

              Thank you for the suggestion regarding second derivatives, but it's not really the information I'm looking for.

              To answer your other question, I want to be able to do the following, when given access to ifHCInOctets and ifHCOutOctets:
              1. See how much data was consumed by a given router/port in each month (ie, 12:01 on the 1st of the month to 12:01 on the 1st of the next month, NOT 30-day sums)
              2. See the current data consumption by a given router/port this month thus-far (ie, 12:01 AM on 1st of the month to today)
              3. Data consumption split by different routers/ports, displayed on the same graph (ie, 200TB on 1/1/4, 120 TB on 1/1/3)
              4. Have all of this displayed in Zabbix

              It's worth noting that my original approach WORKS, and IS WORKING NICELY, except that whenever a router goes down for any reason the item needs re-initialized (ie, set to a static number until it becomes 'supported,' then set back to the correct formula). All I want is to eliminate this weakness. I have a backup method (simply charting consumption data without using the max + delta method - when a router goes down, I can check the uncounted bytes and manually add them back in to the mysql table), but it's super tedious, and I may not be around forever to do it.

              If there's a way to show these four things, I'm also all ears. But a pure-math way of doing this is of no help to me whatsoever - I require a practical method of doing this in Zabbix.

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Originally posted by dampersand
                ...but how is it easy in Zabbix? You keep saying it's easy, but that's what I said until I tackled the problem, too. It IS mathematically easy - give me some numbers, and I'll Riemann-sum them out. It's NOT Zabbix-a-matically easy.
                Misunderstanding
                I want only to sa that it i (reatively) easy to implement new function in zabbix.
                Not that it is easy in current zabbix :P
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • dampersand
                  Junior Member
                  • Apr 2016
                  • 16

                  #9
                  Oh, hah! Here I thought there was some obvious setting you were talking about that I simply couldn't find! xD

                  Well, crud, I'm still stuck, then. Maybe I'll have to come up with a script that directly edits the MySQL tables, instead :/

                  Comment

                  • kloczek
                    Senior Member
                    • Jun 2006
                    • 1771

                    #10
                    Originally posted by dampersand
                    To answer your other question, I want to be able to do the following, when given access to ifHCInOctets and ifHCOutOctets:
                    1. See how much data was consumed by a given router/port in each month (ie, 12:01 on the 1st of the month to 12:01 on the 1st of the next month, NOT 30-day sums)
                    2. See the current data consumption by a given router/port this month thus-far (ie, 12:01 AM on 1st of the month to today)
                    3. Data consumption split by different routers/ports, displayed on the same graph (ie, 200TB on 1/1/4, 120 TB on 1/1/3)
                    4. Have all of this displayed in Zabbix
                    Let me rephrase your goal to have kind of confirmation that I understand your needs.
                    Seems you want to have kind of histogram graph showing data taken from counters to add them to one point on graph in given interval to form kind of bar graph.

                    I'll continue kind of loud thinking from here assuming that it is true

                    Normally such problem should be solved by sampling such counter on given very long period. There are two problems on top of this one which found that if router will be restarted or statistics cleared readying one time a month will be disrupted by such operational changes. Second problem is that you cannot observe progress of growing these bars in meantime.
                    My understanding is that your goal is to use heights of these bars to balance from of the data inside router because with known topology of connections between backplane, extensions cards and groups of ports using height of those bars may possible to spot some problems created by saturations of some physical paths.

                    Again I'll assume here for now that it is your goal (if it is not we will continue with discarding below part). If it is true I think that it is maybe kind of not optimal method of preventing such pathological situations because better would be to have access to some data showing current saturation of some physical paths because counting traffic in periodically created counters hides temporary saturations.
                    My understanding is that your goal is detect some long term congestions caused by passing to many traffic over exact ports.

                    Generally I see some possibilities with kind of histogram data types where new value of the metric is updated in current value of the zabbix item or on update is doe adding new value to item value.

                    Let's name such data as histogram data

                    I see possibility of using such data on monitoring zabbix database backend with partitioned tables.
                    I've been thinking quite long time about organize monitoring of the size history* tables partitions. With for example daily partitions I want to see how big those portions are every day. Period here is not relevant.
                    It is easy to obtain current partition size and store as it is in zabbix item.
                    Such graph form kind of saw graph which observers in longer scale shows growth size daily allocated in DB data.
                    Such monitoring works but with kind of histogram data type it would be possible to present the same data in more natural form.
                    So as you see I found kind of another possibility of using "histogram data type" :P
                    Probably other people will find more useful cases.

                    Back to you dilemmas.If you are worry about unbalance traffic across router you should be not worry only about lack of balance in long term but as well in very short term as well because temporary saturations of some ports or paths will cause retransmissions even temporary congestions may have some a bit longer consequences by increase total volume of the traffic.

                    Conclusion: in your case you should classify data about about traffic on these ports as fast changing data and ifHCInOctets and ifHCOutOctets OIDs should be very often sampled and by this you will be able to spot on the [all] type of graphs (not [avg] or other) all early signs of paths saturations. With sampling these counters lets say one time a minute you are not fully aware of real fluctuations of the flow of the data going across exact paths.
                    (IMO in most cases [all] graph type is the best for long term observations)

                    I had the same situation on some very heavy working system with saturation of the involuntary context switches/s. As long as I've been sampling system.switches[] only one or two times per minute I was unable to see real band of fluctuations of cs/s and I was not aware that time to time system been hitting plateau.

                    Back to calculated item with integral in between current time and arbitrary point in each day/week/month with more frequent sampling data any consequences of zeroing of those counters will be lower and integral value by this will be more precise.

                    Above paragraph could be uses to form definition our imaginary "histogram data type" as singe point in period metrics where new value of the metrics can be used to update current value or used on decrease/add new value to/from current value. After passing arbitrary period of time new point in item should be initiated.
                    Such data type will automatically create point with integrals.

                    Another variation which I see is instead of using calculated item use integral() function on top of normal speed items to raise some alarms about for example exceeding some total data volume transferred in last hour(s)/day(s)/week(s)/month(s).

                    I think that I should stop here to allow you to verify/confirm my thoughts and to add or remove some parts from picture which we are trying to draw together.

                    I'm looking forward for your comments about as close my thoughts are aligned with yours
                    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                    https://kloczek.wordpress.com/
                    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                    My zabbix templates https://github.com/kloczek/zabbix-templates

                    Comment

                    • dampersand
                      Junior Member
                      • Apr 2016
                      • 16

                      #11
                      Er... not quite... Sorry! That was an interesting train of thought, but you're answering my request for one type of data by telling me I don't want that data, and that some OTHER data would be better.

                      I know exactly what data I'm looking for - amount of data consumed on a given port over a given time. My question isn't even how to find that data - I'm finding that data. We're getting good, solid, strong data. My question is how to get that data in a robust fashion that can survive a router restart.

                      Here, perhaps I can clear up the misunderstanding a little better:

                      I work for a small hosting company. We have lots of dedicated servers, each occupying a different port on a switch. We're curious how much data each server is using in a given month. We also have multiple ISPs - we're curious how much data we're sending/receiving from each ISP in a given month.

                      You mention coming up with methods for seeing traffic discrepancy and bandwidth reporting, or traffic fluctuations - The reason I'm not asking for that is that I've already set that up. I'm polling ifHCInOctets and ifHCOutOctets on 30 second intervals already, and recording the time-based delta. I'm using that to chart bandwidth (B/s) already. I even have the php-weathermap plugin working on Zabbix (which, let me tell you, was a messy endeavor). My Zabbix install is able to do ABSOLUTELY everything cacti could do...

                      EXCEPT

                      Show 'total bandwidth used' over an arbitrary time period.

                      Customers want to know how much data they're using per month (actual amount, not 95th percentile). The HA team and the shared team want to know how much data they're using per month. The systems guys want to know how much data is being sent and received from their backup units per month. The execs and accountants want to know how much actual data is being sent to the ISPs per month.

                      Hopefully that clears it up a little

                      Now, you mentioned charting OIDs once on the 1st and once on the 31st... and you also pointed out why that doesn't work well (a router restart breaks it). My next option was just to do a really slow additive method:

                      Item 1 (Delta) = Simple delta value of ifHCInOctets once per 30 seconds, positive integers only
                      Item 2 (Consumption) = Item 2 + Item 1, once per 30 seconds
                      Item 3 (Start-Of-Month Consumption) = Item 2, polled at 12:01 AM on the 1st of every month
                      Item 4 (Month-To-Date Consumption) = Item 3 - Item 2, once per minute.

                      As you can see, this method SHOULD be foolproof - upon a router restart, Item 1 will go negative for 30 seconds, then immediately go positive. The trouble is that since Item 2 relies on Item 2 AND Item 1, whenever Item 1 goes negative (unsupported), Item 2 immediately goes unsupported. Although Item 1 comes back to supported, Item 2 makes a check of its dependencies, sees that Item 2 is unsupported, and quits - Item 2, then, never re-supports on its own.

                      Comment

                      • dampersand
                        Junior Member
                        • Apr 2016
                        • 16

                        #12
                        Also, this chart shows an example of how I might use the data (numbers and switch names removed).



                        On the right, the graph is of Item 4 - Month-To-Date Consumption. Notice how on the first of the month it drops to 0. As you progress through the month, you can see the consumption increase, and it should peak out on the final day of the month.

                        On the left, you'll see that I'm charting current month-to-date consumption, and last month's consumption (found by the formula last("item3", #1) - last("item3", #2) ).

                        Effectively, I've created an executive report that runs perfectly... right up until a router goes down. Once a router goes down, I need to set Item 2 to a static number until it becomes re-supported, then swap it out for the Item 2 + Item 1 formula.

                        Comment

                        • kloczek
                          Senior Member
                          • Jun 2006
                          • 1771

                          #13
                          So generally I see that you ae interested about the exact type of data because you have kind of accounting needs.
                          In such case only (IMO) question is: do need those accounting data on time a month or do you want to have an access to partial data in meantime as well?

                          In first case, my advice would stop trying to bend zabbix to provide such business data and write a short script which over zabbix API would be able to generate the list of itemid to generate accounting summary data by calculating integral in given period by query trends data (only).
                          If you have slave DB you can do this by executing such script against slave DB to not disturb master DB performance.

                          Some general thoughts about using items delta instead delta per second.
                          Such "speed" calculation is useful only in case of very slow changing counters when you want to observe more fact that change happened than a value of the change.
                          Using delta instead delta per second creates problems when you want to change the sampling rate. Using (always) delta per second reduces such issue.

                          BTW: you graph name is a bit misleading. You have "Monthly Bandwidth Usage" when in reality it is "Monthly Total Transferred Data" or volume.
                          Bandwidth is speed factor (first deriv ov. time). In you case it is total (a)counted data volume

                          I have some idea about how to organize accounted data volume calculated item definition.
                          You will need:
                          - normal item with delta per second with your ifHCInOctets and ifHCOutOctets OIDs
                          - calculated items with definition*
                          TotalTransferIn = last(TotalTransferIn) + last(ifHCInOctets)*history_interval
                          TotalTransferOut = last(TotalTransferOut) + last(ifHCOutOctets)*history_interval

                          Additionally, you will need to have generated the full list of those TotalTransferIn and TotalTransferOut and one time per month from crontab you need to put 0 in last values of those items. One thing that history interval of pairs of ifHCInOctets and TotalTransferIn needs to be exactly the same.
                          Another issue may be with the first value of such counters. Probably every newly created item will need to add the first point in series of data equal 0 because last() function probably return the error and by this item may end up without such initialisation in the permanent unsupported state. one-liner used on resetting data every month could be used as well on initialisation.

                          The precision of those TotalTransferIn/TotalTransferOut may be not perfect but on the scale of counting the volume of data per whole month, it should be enough to produce values with error probably below 0.01%.
                          Last edited by kloczek; 16-08-2016, 03:19.
                          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                          https://kloczek.wordpress.com/
                          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                          My zabbix templates https://github.com/kloczek/zabbix-templates

                          Comment

                          • dampersand
                            Junior Member
                            • Apr 2016
                            • 16

                            #14
                            Originally posted by kloczek
                            So generally I see that you ae interested about the exact type of data because you have kind of accounting needs.
                            In such case only (IMO) question is: do need those accounting data on time a month or do you want to have an access to partial data in meantime as well?
                            Yup! Hit it on the head. I would prefer partial data in the meantime.

                            In first case, my advice would stop trying to bend zabbix to provide such business data and write a short script which over zabbix API would be able to generate the list of itemid to generate accounting summary data by calculating integral in given period by query trends data (only).
                            This is an option, of course, but I don't really want to exercise it - direct DB queries is something we did in Cacti, too, and no matter how well you document it, whoever inherits it will forget to maintain it - I guarantee. Naturally I'm writing on a Zabbix forum because I'm looking for a Zabbix answer. :P

                            Some general thoughts about using items delta instead delta per second.
                            Such "speed" calculation is useful only in case of very slow changing counters when you want to observe more fact that change happened than a value of the change.
                            Using delta instead delta per second creates problems when you want to change the sampling rate. Using (always) delta per second reduces such issue.
                            Using delta-per-second and then re-integrating over time is something that I will absolutely disagree with you on. Mainly because sometimes I don't KNOW the time variable to integrate on (if I pay attention to the logs, a 30 second polling interval may sometimes take a few extra seconds here or there, which means an integral over a 'nominal' 30 seconds will lose bytes), but also because frankly, it's an unnecessary extra step. If you've got data A and need data A, you don't convert A to B and then B to A. That's like taking the Laplace transform of 2 + 2 in order to find that it's 4 - no sense taking extra steps.

                            BTW: you graph name is a bit misleading. You have "Monthly Bandwidth Usage" when in reality it is "Monthly Total Transferred Data" or volume.
                            Bandwidth is speed factor (first deriv ov. time). In you case it is total (a)counted data volume
                            Or, to be more consistent with this entire thread, "Consumption."

                            I have some idea about how to organize accounted data volume calculated item definition.
                            You will need:
                            - normal item with delta per second with your ifHCInOctets and ifHCOutOctets OIDs
                            - calculated items with definition*
                            TotalTransferIn = last(TotalTransferIn) + last(ifHCInOctets)*history_interval
                            TotalTransferOut = last(TotalTransferOut) + last(ifHCOutOctets)*history_interval
                            I don't mean to be rude, but I don't think you're reading my posts - this is almost word for word what I'm already doing, except that my method uses 'max(86400)' instead of 'last' to account for any crappy numbers that might accidentally find their way in. I outlined both in the last post and in my first post the problem with this method - if ever "ifHCInOctets" goes unsupported, it will cause "TotalTransferIn" (in my words, "ConsumptionIn") to go unsupported, which will start a 'not supported' loop.

                            Additionally, you will need to have generated the full list of those TotalTransferIn and TotalTransferOut and one time per month from crontab you need to put 0 in last values of those items. One thing that history interval of pairs of ifHCInOctets and TotalTransferIn needs to be exactly the same.
                            The problem you're having with the history interval is exactly why I'm using 'delta' instead of 'delta per second.' You are having problems because you're changing A -> B -> A.

                            Another issue may be with the first value of such counters. Probably every newly created item will need to add the first point in series of data equal 0 because last() function probably return the error and by this item may end up without such initialisation in the permanent unsupported state. one-liner used on resetting data every month could be used as well on initialisation.
                            Yup. I already use this initialization method. In fact, when the problem in my initial post occurred, I summed up the deltas and added them to the last known 'good' number on Consumption, then initialized with that number. It's a right pain in the ass - it must be done manually for every router that goes under. It's a backup method, and not very acceptable.


                            Listen, thank you very much for your help, I appreciate it a lot, but... thus far you've told me that the data I want is not the data I want, that the solution should be easy for zabbix to implement (but doesn't exist), and finally you've parroted the solution that I've been asking for an upgrade to. I really really appreciate you taking a look, but I'm not really sure we're on the same page here.

                            Comment

                            • kloczek
                              Senior Member
                              • Jun 2006
                              • 1771

                              #15
                              Originally posted by dampersand
                              Listen, thank you very much for your help, I appreciate it a lot, but... thus far you've told me that the data I want is not the data I want, that the solution should be easy for zabbix to implement (but doesn't exist), and finally you've parroted the solution that I've been asking for an upgrade to. I really really appreciate you taking a look, but I'm not really sure we're on the same page here.
                              Sorry my fault. It was a bit to late ..
                              I'll try to have look on whole conversation one more time maybe at the end of today.
                              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                              https://kloczek.wordpress.com/
                              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                              My zabbix templates https://github.com/kloczek/zabbix-templates

                              Comment

                              Working...