Ad Widget

**steveboyson** · 20-03-2014, 12:48

Zabbix separates the logic for metrics gathering and threshold evaluation difference to nagios which binds these data together.

In "Zabbix language":
- metrics is "item"
- threshold evaluation is "trigger"

Basically, in a first step you collect data via an "item". This is the actual check to perform. It just receives a data value.
In a second step - whenever a data value comes in - optionally configured triggers are evaluated. Based on the trigger conditions, this will change the item's state and optionally execute configured alerting actions (that is: mail, script, jabber message and the like).

For the vSphere part I would like to refer to the available documentation ...

**ibtanhe** · 20-03-2014, 16:17

Umm, thanks for the reply steveboyson, but it seems to me that you didn't even read my post before answering. I know what items/triggers are, that's not issue.

**aib** · 20-03-2014, 16:22

Originally posted by ibtanhe

I've got this Nagios-check that connects to our VMware Virtual Center-machines and looks at every datastore, determining (using warning and critical thresholds) whether we need to alert the techies or not. This check is run every X minutes against each of our VC-servers, indifferent of how many datastores exist on each VC.

Now, AFAIKT, the Zabbix way consists of creating some kind of "magic" autodiscovery of datastores on each VC, which creates the appropriate items/triggers/whatnot for _each_ enumerated datastore.

So, assuming I have a script on the Zabbix-server which takes hostname/IP, credentials and datastore as arguments, it ought to run this check for each datastore, on each VC. Is this correct? If so, I'ts a terrible, terrible solution which must perform catastrophically performance-wise and I'm sincerely hoping there's a smarter solution which you might help me out with!

Please, explain me the difference between NAGIOS-way and ZABBIX-way in checking two VMWare VC with five datastore on each.
- Nagios has to ask each datastore and decide "whether we need to alert the techies or not"
- Zabbix has to collect information from each datastore and check the triggers to decide "whether we need to alert the techies or not"

Discovery rule are using once to create all Items/Triggers/Graphs/Guest VM's. Yes, you can configure Discovery for cyclical repetition of each hour/day/week/whatever - but it's not important if you have a stable environment without often creating/deleting GuestVM/datastore/etc.

What do you worry about?

**steveboyson** · 20-03-2014, 16:32

Originally posted by ibtanhe

Umm, thanks for the reply steveboyson, but it seems to me that you didn't even read my post before answering. I know what items/triggers are, that's not issue.

Well, and it seems to me that you did not read my statement carefully. Zabbix works different than Nagios as it is following the scheme "one check - one value".

That means, "NTP server cannot be reached" and "NTP server's time offset is > $NUM" are TWO zabbix checks with at least one trigger each, although each item can have a .nodata() trigger which would cover unreachability.

Nota bene: nobody prevents you from emitting a "ZBX_UNSUPPORTED" up to zabbix in your ntp_check script if the upstream server fails ...

**coreychristian** · 20-03-2014, 16:43

Originally posted by ibtanhe

Umm, thanks for the reply steveboyson, but it seems to me that you didn't even read my post before answering. I know what items/triggers are, that's not issue.

Just to point out, you mentioned you were new and it's always best to check the obvious first, especially with someone who is new to the tool or trying to understand the differences.

Originally posted by ibtanhe

Say that I want to check whether a host is NTP-synchronized or not. Issuing a remote script which in turn runs for example "ntpdate -q server" and extracting the offset is easy enough, that can be used as an item. But say for example that something unexpected happens, maybe the NTP-server cannot be reached for the time being. ntpdate will return offset 0.00000, so potentially this could be interpreted as OK. In Nagios, I would of course check the exitcode from ntpdate and use that to compose the correct check-returncode and message, but how would I correctly recreate this check in Zabbix? I would like the trigger to alert us if the time difference is too big or if the NTP-server is unreachable. On really ugly way would be if the script returns a predetermined bogus-item-data if ntpdate fails (say -666 or anything else less likely to happen IRL) and use that in the trigger somehow.

Just to verify with this, you are essentially checking two things, one does ntpdate run, two what is the time difference returned.

There are a few different ways you could do this within zabbix.

I would probably do something like the following.

1. As you mentioned have an item that the off set is populated in.
2. Then create two triggers, one trigger if there has been no data (example: 'nodata(10m)'), another trigger if the offset is greater then X.

**ibtanhe** · 20-03-2014, 16:58

Hey!

Originally posted by aib

Please, explain me the difference between NAGIOS-way and ZABBIX-way in checking two VMWare VC with five datastore on each.
- Nagios has to ask each datastore and decide "whether we need to alert the techies or not"
- Zabbix has to collect information from each datastore and check the triggers to decide "whether we need to alert the techies or not"

The (my) Nagios-way would be running the check once per VC, no matter how many datastores there are.

If no thresholds are reached (for any datastore), return code 0 and a simple "OK".
If warning-thresholds are reached, return code 1 and the name(s) of datastore(s) that have reached said threshold.
If critical-thresholds are reached, return code 2 and the name(s) of datastore(s) that have reached said threshold.
If the check fails for some reason, I can return 2 (CRITICAL) or 3 (UNKNOWN) depending on how I want to proceed.
That's a helluva lot of information returned.

Now the Zabbix-way (and this is where I stumble) seems to be to have an item per datastore plus additional trigger(s). The way I see it, Zabbix makes one itemdata-collection per datastore per VC, is this correct?

Originally posted by aib

Discovery rule are using once to create all Items/Triggers/Graphs/Guest VM's. Yes, you can configure Discovery for cyclical repetition of each hour/day/week/whatever - but it's not important if you have a stable environment without often creating/deleting GuestVM/datastore/etc.

What do you worry about?

Well, we're running a datacenter. Our current Nagios-setup monitors 900+ hosts and 12000+ services. It's not unstable, but it sure isn't static

The discovery is a great feature. In Nagios, I'm just letting the check do that inline every time, so it's just two different ways of doing the same thing.

**ibtanhe** · 20-03-2014, 17:06

Originally posted by steveboyson

Well, and it seems to me that you did not read my statement carefully. Zabbix works different than Nagios as it is following the scheme "one check - one value".

That means, "NTP server cannot be reached" and "NTP server's time offset is > $NUM" are TWO zabbix checks with at least one trigger each, although each item can have a .nodata() trigger which would cover unreachability.

Nota bene: nobody prevents you from emitting a "ZBX_UNSUPPORTED" up to zabbix in your ntp_check script if the upstream server fails ...

I did read your statement. You briefly explained items and triggers and referred to the available vSphere documentation which had little to do with my question. This answer, however, was much better so thank you!

I had not run into .nodata() yet, will have to check that out! Also, the ZBX_UNSUPPORTED tip could be useful!

Cheers!

**ibtanhe** · 20-03-2014, 17:10

Hey!

Originally posted by coreychristian

Just to point out, you mentioned you were new and it's always best to check the obvious first, especially with someone who is new to the tool or trying to understand the differences.

Just to verify with this, you are essentially checking two things, one does ntpdate run, two what is the time difference returned.

There are a few different ways you could do this within zabbix.

I would probably do something like the following.

1. As you mentioned have an item that the off set is populated in.
2. Then create two triggers, one trigger if there has been no data (example: 'nodata(10m)'), another trigger if the offset is greater then X.

Thanks for the suggestion! So it would suffice if the check returns no output to trigger (umm, no pun intended) the trigger?

Cheers!

**steveboyson** · 20-03-2014, 17:12

For returning more than one value in a single item call you may want to use trapper items.

That is: perform one check against your vcenter server (means: one item), let this script calculate different metrics, send them all back to zabbix via trapper items and finally let your script return a success value.

Using that way you can collect several item metrics with just a single item call. And of course, evaluate several triggers.

**steveboyson** · 20-03-2014, 17:15

Originally posted by ibtanhe

Hey!
Thanks for the suggestion! So it would suffice if the check returns no output to trigger (umm, no pun intended) the trigger?
Cheers!

Regular agent items do have an intervall (you might know this already). If an item does not receive a value in the given intervall, an internal mechanism kicks in that would fire a configured ".nodata" trigger - automatically. You just have to specify that .nodata($TIME|$VALUES) trigger.

If all that was known to you then I do not understand your question. Sorry.

**aib** · 20-03-2014, 17:18

Originally posted by ibtanhe

The (my) Nagios-way would be running the check once per VC, no matter how many datastores there are.

If no thresholds are reached (for any datastore), return code 0 and a simple "OK".
If warning-thresholds are reached, return code 1 and the name(s) of datastore(s) that have reached said threshold.
If critical-thresholds are reached, return code 2 and the name(s) of datastore(s) that have reached said threshold.
If the check fails for some reason, I can return 2 (CRITICAL) or 3 (UNKNOWN) depending on how I want to proceed.
That's a helluva lot of information returned.

The good old days when I had using Nagios + MRTG.
I missed it but not much.

In NAGIOS you still have some scripts which request and check all datastores. Right?
And you create only one trigger on FrontEnd side.
1) Nobody can stop you from using the same script as UserParameter and create one Item to show the script result.
Then you can create One trigger for One Item and - Profit!

2) In Zabbix you can also create as many Items as you need and as many Triggers as you need.
Also you can create one MEGA-trigger which will check all datastores threshold and switch only if you have any problem.
It will fully emulates your Nagios behavior.

So far you already have two different ways to accomplish the task.
Right?

Well, we're running a datacenter. Our current Nagios-setup monitors 900+ hosts and 12000+ services. It's not unstable, but it sure isn't static

The discovery is a great feature. In Nagios, I'm just letting the check do that inline every time, so it's just two different ways of doing the same thing.

One more great thing that you can create a Template which will be automatically assigned to Discovered Hosts and you will collect only that information which you qualify as important for this type of guest VM.
(for example, for Windows DB server - some metrics of DB can be included into Windows DB template; for LAMP VM - some metrics for Mysql/Apache/OS can be included into LAMP template)

Sorry, I like Zabbix so much that I cannot stop talking about it.
Please, don't hesitate to ask more question and I would like to do my best to answer it.

**ibtanhe** · 20-03-2014, 17:22

Originally posted by steveboyson

For returning more than one value in a single item call you may want to use trapper items.

That is: perform one check against your vcenter server (means: one item), let this script calculate different metrics, send them all back to zabbix via trapper items and finally let your script return a success value.

Using that way you can collect several item metrics with just a single item call. And of course, evaluate several triggers.

This seems very interesting. You see, one of my concerns is to avoid putting unnecessary strain on the servers (both Zabbix-server and the monitored hosts). This particular datastore-check I've mentioned is very expensive resource-wise and I would be pure madness running it once per datastore. Using trapper-items, I can see another way. Using Discovery, set up the necessary trapper-items+triggers and use a modified version of my script to collect data for all datastores collectively and finish off by submitting the data, item by item, to Zabbix. Also, I'm counting on that .nodata() can be applied to trapper-items as well to catch potential problems.

**steveboyson** · 20-03-2014, 17:27

Nope, since trapper items have no intervall, zabbix does not know in which time period values were expected and thus cannot fire up .nodata triggers.
Pretty obvious, I might think.

**steveboyson** · 20-03-2014, 17:33

You might have noticed that querying a vSphere datastore via the Perl or Python API is a time & resource consuming task.

Therefore I doubt you can handle that in a single item call. At least in our vSphere environment a single call lasts up to 45 seconds per ESX host which is 15 seconds longer than the configurable maximum agent or server timeout value.

We perform periodic checks on the ESX hosts (running on the vMA against our vcenter server via a cron job), store the gathered values in a parseable text file, send that text file to zabbix and let the zabbix server do the parsing and item delivery stuff.
We check the filedate of that file and fire up a trigger if it is older than $NUMBER of minutes so we have control over the work flow.

Of course, this periodic cron job could emit the values directly via trapper items to zabbix. But as mentioned before, they have no .nodata triggers.

Ad Widget

Designing "checks", best practices

Designing "checks", best practices

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment