Ad Widget

Collapse

Designing "checks", best practices

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ibtanhe
    Junior Member
    • Feb 2014
    • 17

    #1

    Designing "checks", best practices

    Hi all!

    We've decided to make the switch to Zabbix and I'm currently looking at recreating everything we have running in Nagios. I've spent the last few
    weeks looking at different parts of Zabbix and I really like that I see in most
    areas, except one. The "checks".

    I've googled like a mad man for information regarding how to create checks "the Zabbix way" and I've come to the conclusion that I'm either missing something really important (very likely, hence this post) or that we're gonna have huge difficulties in recreating all our current checks.

    An example:

    I've got this Nagios-check that connects to our VMware Virtual Center-machines and looks at every datastore, determining (using warning and critical thresholds) whether we need to alert the techies or not. This check is run every X minutes against each of our VC-servers, indifferent of how many datastores exist on each VC.

    Now, AFAIKT, the Zabbix way consists of creating some kind of "magic" autodiscovery of datastores on each VC, which creates the appropriate items/triggers/whatnot for _each_ enumerated datastore.

    So, assuming I have a script on the Zabbix-server which takes hostname/IP, credentials and datastore as arguments, it ought to run this check for each datastore, on each VC. Is this correct? If so, I'ts a terrible, terrible solution which must perform catastrophically performance-wise and I'm sincerely hoping there's a smarter solution which you might help me out with!

    The other thing which I'm missing from Nagios big-time is the fact that a check can return both a message and a return-code. Is there any way to do the same in Zabbix? I'll show you an example:

    Say that I want to check whether a host is NTP-synchronized or not. Issuing a remote script which in turn runs for example "ntpdate -q server" and extracting the offset is easy enough, that can be used as an item. But say for example that something unexpected happens, maybe the NTP-server cannot be reached for the time being. ntpdate will return offset 0.00000, so potentially this could be interpreted as OK. In Nagios, I would of course check the exitcode from ntpdate and use that to compose the correct check-returncode and message, but how would I correctly recreate this check in Zabbix? I would like the trigger to alert us if the time difference is too big or if the NTP-server is unreachable. On really ugly way would be if the script returns a predetermined bogus-item-data if ntpdate fails (say -666 or anything else less likely to happen IRL) and use that in the trigger somehow.

    Cheers!

    -- Andy
  • steveboyson
    Senior Member
    • Jul 2013
    • 582

    #2
    Zabbix separates the logic for metrics gathering and threshold evaluation difference to nagios which binds these data together.

    In "Zabbix language":
    - metrics is "item"
    - threshold evaluation is "trigger"

    Basically, in a first step you collect data via an "item". This is the actual check to perform. It just receives a data value.
    In a second step - whenever a data value comes in - optionally configured triggers are evaluated. Based on the trigger conditions, this will change the item's state and optionally execute configured alerting actions (that is: mail, script, jabber message and the like).

    For the vSphere part I would like to refer to the available documentation ...

    Comment

    • ibtanhe
      Junior Member
      • Feb 2014
      • 17

      #3
      Umm, thanks for the reply steveboyson, but it seems to me that you didn't even read my post before answering. I know what items/triggers are, that's not issue.

      Comment

      • aib
        Senior Member
        • Jan 2014
        • 1615

        #4
        Originally posted by ibtanhe
        I've got this Nagios-check that connects to our VMware Virtual Center-machines and looks at every datastore, determining (using warning and critical thresholds) whether we need to alert the techies or not. This check is run every X minutes against each of our VC-servers, indifferent of how many datastores exist on each VC.

        Now, AFAIKT, the Zabbix way consists of creating some kind of "magic" autodiscovery of datastores on each VC, which creates the appropriate items/triggers/whatnot for _each_ enumerated datastore.

        So, assuming I have a script on the Zabbix-server which takes hostname/IP, credentials and datastore as arguments, it ought to run this check for each datastore, on each VC. Is this correct? If so, I'ts a terrible, terrible solution which must perform catastrophically performance-wise and I'm sincerely hoping there's a smarter solution which you might help me out with!
        Please, explain me the difference between NAGIOS-way and ZABBIX-way in checking two VMWare VC with five datastore on each.
        - Nagios has to ask each datastore and decide "whether we need to alert the techies or not"
        - Zabbix has to collect information from each datastore and check the triggers to decide "whether we need to alert the techies or not"

        Discovery rule are using once to create all Items/Triggers/Graphs/Guest VM's. Yes, you can configure Discovery for cyclical repetition of each hour/day/week/whatever - but it's not important if you have a stable environment without often creating/deleting GuestVM/datastore/etc.

        What do you worry about?
        Sincerely yours,
        Aleksey

        Comment

        • steveboyson
          Senior Member
          • Jul 2013
          • 582

          #5
          Originally posted by ibtanhe
          Umm, thanks for the reply steveboyson, but it seems to me that you didn't even read my post before answering. I know what items/triggers are, that's not issue.
          Well, and it seems to me that you did not read my statement carefully. Zabbix works different than Nagios as it is following the scheme "one check - one value".

          That means, "NTP server cannot be reached" and "NTP server's time offset is > $NUM" are TWO zabbix checks with at least one trigger each, although each item can have a .nodata() trigger which would cover unreachability.

          Nota bene: nobody prevents you from emitting a "ZBX_UNSUPPORTED" up to zabbix in your ntp_check script if the upstream server fails ...
          Last edited by steveboyson; 20-03-2014, 16:35.

          Comment

          • coreychristian
            Senior Member
            Zabbix Certified Specialist
            • Jun 2012
            • 159

            #6
            Originally posted by ibtanhe
            Umm, thanks for the reply steveboyson, but it seems to me that you didn't even read my post before answering. I know what items/triggers are, that's not issue.
            Just to point out, you mentioned you were new and it's always best to check the obvious first, especially with someone who is new to the tool or trying to understand the differences.


            Originally posted by ibtanhe
            Say that I want to check whether a host is NTP-synchronized or not. Issuing a remote script which in turn runs for example "ntpdate -q server" and extracting the offset is easy enough, that can be used as an item. But say for example that something unexpected happens, maybe the NTP-server cannot be reached for the time being. ntpdate will return offset 0.00000, so potentially this could be interpreted as OK. In Nagios, I would of course check the exitcode from ntpdate and use that to compose the correct check-returncode and message, but how would I correctly recreate this check in Zabbix? I would like the trigger to alert us if the time difference is too big or if the NTP-server is unreachable. On really ugly way would be if the script returns a predetermined bogus-item-data if ntpdate fails (say -666 or anything else less likely to happen IRL) and use that in the trigger somehow.
            Just to verify with this, you are essentially checking two things, one does ntpdate run, two what is the time difference returned.

            There are a few different ways you could do this within zabbix.

            I would probably do something like the following.

            1. As you mentioned have an item that the off set is populated in.
            2. Then create two triggers, one trigger if there has been no data (example: 'nodata(10m)'), another trigger if the offset is greater then X.

            Comment

            • ibtanhe
              Junior Member
              • Feb 2014
              • 17

              #7
              Hey!

              Originally posted by aib
              Please, explain me the difference between NAGIOS-way and ZABBIX-way in checking two VMWare VC with five datastore on each.
              - Nagios has to ask each datastore and decide "whether we need to alert the techies or not"
              - Zabbix has to collect information from each datastore and check the triggers to decide "whether we need to alert the techies or not"
              The (my) Nagios-way would be running the check once per VC, no matter how many datastores there are.

              If no thresholds are reached (for any datastore), return code 0 and a simple "OK".
              If warning-thresholds are reached, return code 1 and the name(s) of datastore(s) that have reached said threshold.
              If critical-thresholds are reached, return code 2 and the name(s) of datastore(s) that have reached said threshold.
              If the check fails for some reason, I can return 2 (CRITICAL) or 3 (UNKNOWN) depending on how I want to proceed.
              That's a helluva lot of information returned.

              Now the Zabbix-way (and this is where I stumble) seems to be to have an item per datastore plus additional trigger(s). The way I see it, Zabbix makes one itemdata-collection per datastore per VC, is this correct?

              Originally posted by aib
              Discovery rule are using once to create all Items/Triggers/Graphs/Guest VM's. Yes, you can configure Discovery for cyclical repetition of each hour/day/week/whatever - but it's not important if you have a stable environment without often creating/deleting GuestVM/datastore/etc.

              What do you worry about?
              Well, we're running a datacenter. Our current Nagios-setup monitors 900+ hosts and 12000+ services. It's not unstable, but it sure isn't static

              The discovery is a great feature. In Nagios, I'm just letting the check do that inline every time, so it's just two different ways of doing the same thing.

              Comment

              • ibtanhe
                Junior Member
                • Feb 2014
                • 17

                #8
                Originally posted by steveboyson
                Well, and it seems to me that you did not read my statement carefully. Zabbix works different than Nagios as it is following the scheme "one check - one value".

                That means, "NTP server cannot be reached" and "NTP server's time offset is > $NUM" are TWO zabbix checks with at least one trigger each, although each item can have a .nodata() trigger which would cover unreachability.

                Nota bene: nobody prevents you from emitting a "ZBX_UNSUPPORTED" up to zabbix in your ntp_check script if the upstream server fails ...
                I did read your statement. You briefly explained items and triggers and referred to the available vSphere documentation which had little to do with my question. This answer, however, was much better so thank you!

                I had not run into .nodata() yet, will have to check that out! Also, the ZBX_UNSUPPORTED tip could be useful!

                Cheers!

                Comment

                • ibtanhe
                  Junior Member
                  • Feb 2014
                  • 17

                  #9
                  Hey!

                  Originally posted by coreychristian
                  Just to point out, you mentioned you were new and it's always best to check the obvious first, especially with someone who is new to the tool or trying to understand the differences.

                  Just to verify with this, you are essentially checking two things, one does ntpdate run, two what is the time difference returned.

                  There are a few different ways you could do this within zabbix.

                  I would probably do something like the following.

                  1. As you mentioned have an item that the off set is populated in.
                  2. Then create two triggers, one trigger if there has been no data (example: 'nodata(10m)'), another trigger if the offset is greater then X.
                  Thanks for the suggestion! So it would suffice if the check returns no output to trigger (umm, no pun intended) the trigger?

                  Cheers!

                  Comment

                  • steveboyson
                    Senior Member
                    • Jul 2013
                    • 582

                    #10
                    For returning more than one value in a single item call you may want to use trapper items.

                    That is: perform one check against your vcenter server (means: one item), let this script calculate different metrics, send them all back to zabbix via trapper items and finally let your script return a success value.

                    Using that way you can collect several item metrics with just a single item call. And of course, evaluate several triggers.

                    Comment

                    • steveboyson
                      Senior Member
                      • Jul 2013
                      • 582

                      #11
                      Originally posted by ibtanhe
                      Hey!
                      Thanks for the suggestion! So it would suffice if the check returns no output to trigger (umm, no pun intended) the trigger?
                      Cheers!
                      Regular agent items do have an intervall (you might know this already). If an item does not receive a value in the given intervall, an internal mechanism kicks in that would fire a configured ".nodata" trigger - automatically. You just have to specify that .nodata($TIME|$VALUES) trigger.

                      If all that was known to you then I do not understand your question. Sorry.

                      Comment

                      • aib
                        Senior Member
                        • Jan 2014
                        • 1615

                        #12
                        Originally posted by ibtanhe
                        The (my) Nagios-way would be running the check once per VC, no matter how many datastores there are.

                        If no thresholds are reached (for any datastore), return code 0 and a simple "OK".
                        If warning-thresholds are reached, return code 1 and the name(s) of datastore(s) that have reached said threshold.
                        If critical-thresholds are reached, return code 2 and the name(s) of datastore(s) that have reached said threshold.
                        If the check fails for some reason, I can return 2 (CRITICAL) or 3 (UNKNOWN) depending on how I want to proceed.
                        That's a helluva lot of information returned.
                        The good old days when I had using Nagios + MRTG.
                        I missed it but not much.

                        In NAGIOS you still have some scripts which request and check all datastores. Right?
                        And you create only one trigger on FrontEnd side.
                        1) Nobody can stop you from using the same script as UserParameter and create one Item to show the script result.
                        Then you can create One trigger for One Item and - Profit!

                        2) In Zabbix you can also create as many Items as you need and as many Triggers as you need.
                        Also you can create one MEGA-trigger which will check all datastores threshold and switch only if you have any problem.
                        It will fully emulates your Nagios behavior.

                        So far you already have two different ways to accomplish the task.
                        Right?
                        Well, we're running a datacenter. Our current Nagios-setup monitors 900+ hosts and 12000+ services. It's not unstable, but it sure isn't static

                        The discovery is a great feature. In Nagios, I'm just letting the check do that inline every time, so it's just two different ways of doing the same thing.
                        One more great thing that you can create a Template which will be automatically assigned to Discovered Hosts and you will collect only that information which you qualify as important for this type of guest VM.
                        (for example, for Windows DB server - some metrics of DB can be included into Windows DB template; for LAMP VM - some metrics for Mysql/Apache/OS can be included into LAMP template)


                        Sorry, I like Zabbix so much that I cannot stop talking about it.
                        Please, don't hesitate to ask more question and I would like to do my best to answer it.
                        Sincerely yours,
                        Aleksey

                        Comment

                        • ibtanhe
                          Junior Member
                          • Feb 2014
                          • 17

                          #13
                          Originally posted by steveboyson
                          For returning more than one value in a single item call you may want to use trapper items.

                          That is: perform one check against your vcenter server (means: one item), let this script calculate different metrics, send them all back to zabbix via trapper items and finally let your script return a success value.

                          Using that way you can collect several item metrics with just a single item call. And of course, evaluate several triggers.
                          This seems very interesting. You see, one of my concerns is to avoid putting unnecessary strain on the servers (both Zabbix-server and the monitored hosts). This particular datastore-check I've mentioned is very expensive resource-wise and I would be pure madness running it once per datastore. Using trapper-items, I can see another way. Using Discovery, set up the necessary trapper-items+triggers and use a modified version of my script to collect data for all datastores collectively and finish off by submitting the data, item by item, to Zabbix. Also, I'm counting on that .nodata() can be applied to trapper-items as well to catch potential problems.

                          Comment

                          • steveboyson
                            Senior Member
                            • Jul 2013
                            • 582

                            #14
                            Nope, since trapper items have no intervall, zabbix does not know in which time period values were expected and thus cannot fire up .nodata triggers.
                            Pretty obvious, I might think.

                            Comment

                            • steveboyson
                              Senior Member
                              • Jul 2013
                              • 582

                              #15
                              You might have noticed that querying a vSphere datastore via the Perl or Python API is a time & resource consuming task.

                              Therefore I doubt you can handle that in a single item call. At least in our vSphere environment a single call lasts up to 45 seconds per ESX host which is 15 seconds longer than the configurable maximum agent or server timeout value.

                              We perform periodic checks on the ESX hosts (running on the vMA against our vcenter server via a cron job), store the gathered values in a parseable text file, send that text file to zabbix and let the zabbix server do the parsing and item delivery stuff.
                              We check the filedate of that file and fire up a trigger if it is older than $NUMBER of minutes so we have control over the work flow.

                              Of course, this periodic cron job could emit the values directly via trapper items to zabbix. But as mentioned before, they have no .nodata triggers.

                              Comment

                              Working...