PDA

View Full Version : Thoughts from a Nagios user


dminstrel
05-04-2005, 18:20
I'm a long-time open source NMS user. I've been using Nagios since it was called Netsaint and dabbled in Big Brother.

We're currently monitoring over 80 hosts and 200 services with Nagios and it's been working great.

BUT...I'm migrating to Zabbix!

The main reason is that Nagios is mainly a real-time monitoring solution and it's very good at it. Once you want to introduce a trending aspect, you need to slap-in 3rd-party apps like APAN, Perfparse, nagiosgraph, etc. I've even tried an unholy integration with Cacti. It works, but since we're a small IT team at my company, I want to minimize the number of systems we use and manage. We also want to do SLA-related trending and it's a pain to do that Nagios. Another strength on Zabbix is the network map feature – it's a nightmare to create a useful map of more than 20 hosts in Nagios.

One aspect that's the most confusing for a Nagios admin coming to Zabbix is the Hosts/Items/Triggers/Actions aspect versus the Hosts/Services aspect in Nagios. I find that the tight coupling of hosts and services in Nagios is much simpler to understand, set-up and manage than in Zabbix.

A suggestion I could make would be to add a Monitor option to each Item added to an Host that would do a basic trigger setup (a Wizard-type dialog maybe?).

I've also been looking for the Nagios Service Detail screen equivalent in Zabbix, the Overview screen is not quite there yet.

Does anybody else have migration stories to share?

Cheers,

Jonathan

Alexei
05-04-2005, 19:27
Hi Jonathan,

Thanks for sharing your experience! I appreciate it. May I ask you to tell me what, in your opinion, is missing in the Overview screen and how it can be improved? What do you expect from an ideal Overview screen? It would be very useful!

dminstrel
06-04-2005, 00:37
Here's a quick mock-up of what would be great.
(39kb per image is really low!)

What's useful about this screen is that the Host/Service relationships are clear and easy to understand. What I don't like about Alpha 7 is that I don't have a quick, in-your-face status screen as soon as I get into Zabbix.

It could be customized to show only what's wrong (triggered items) or all items/triggers for all hosts. As another poster suggested, the "Top 10" view of what's wrong could also be shown here. You could have a "Nagios compatibility mode" or a "Top 10 mode" or a "Zabbix mode", etc. This has a different purpose from Screens as Screens is used more for trending purposes (display graphs) than real-time monitoring purposes.

You would get this screen instead of the little "..." you get when you click View. An admin could then drill down with the rightmost drop-down lists to display groups, triggers, etc.

Cheers,

Jonathan

Alexei
06-04-2005, 10:08
Thanks for the screenshot! Actually I think combination of "Latest values" screen with some sort of trigger status displaying is what you want.

For example, if there are no freedisk space on volume /var, the item "Free Disk Space on /var" could be in a different color depending on severity of the problem. Does it make sense?

dminstrel
06-04-2005, 23:38
Yeah, something like that.

An even more compact display could be

|HOST|Last check|Latest values|

The latest values field background (cell background) could be white if the latest value of the item is not linked to a trigger, pale green if linked to a trigger but OK, red if trigger is activated, etc.

What are the thoughts of other board members on this?

Cheers,

Jonathan

Tonyb
07-04-2005, 04:04
I am also a netsaint->nagios user thinking about the move to zabbix.

I think that would be a great addition to zabbix.

One other problem i see with zabbix is its lack of external command support.

We use several custom written plugins for nagios, which are just external programs that exit with a specifc exit code. For example we have one plugin that sends an email through one of our mail servers the first time it is run and the next time it is run it checks the specified account via pop3 to test our entire email system.

I think if zabbix had a way to define external Item Types it would take it one step higher.

I don't think it would be very hard, there would need to be a new configuration section for "Item Types" where you could set the path to the external program and define common paramaters (like hostname/IP). Then when zabbix runs the command it could also pass the item key as paramaters.

Maybe thats outside the bounds of zabbix but i really think that would make zabbix much more customizable.

charles
08-04-2005, 01:23
I already made the move to Zabbix from Nagios a while ago. No regrets, but have not used Zabbix to it's full potential either.

External Item Types already exists in Zabbix - that are called User Parameters. Search for UserParameter in the docs.

hth
charles

Tonyb
08-04-2005, 08:14
One key advantage over nagios exist in that the web based config and monitor are all in one.

What if you want to monitor the status of a host from the prospective of the NMS server? It would be much more beneficial to be able to define external commands from the web interface that could then be implement into templates than to manually add them to the zabbix_agent.

If you have to rely on zabbix_agent to run external command from the same server that zabbix_server runs on your going to get into the same situation that nagios is currently in, where the only way to simplify the use of external commands is a third party program. This would greatly reduce the benefits of zabbix’s all in one solution.

charles
08-04-2005, 19:06
Yes, it would be nice to be able to define them in the gui, but in many/most cases you still have to configure or install something on the server to be monitored anyway.

The script is run on the server monitored, not the monitoring server btw

charles

Tonyb
08-04-2005, 19:26
Many times you want the monitoring server to run the script, for example for checking DNS servers instead of just checking if the process is running the server could be queried from the monitoring server and checked to see if a value response was received.

With a plug-in type architecture you would just have to copy the plugin into a specified directory and make sure it is executable. Then the entire configuration would be done through the web interface.

I’m not proposing this as a replacement to using agentd to run commands on the remote host, but rather to extend the number of checks the monitoring server can do.

dminstrel
08-04-2005, 21:12
I second that. This is an aspect where Nagios shines compared to Zabbix as it allows to run external scripts on the monitoring server. So for instance, if I want more detail from my ping checks (to do something similar to SmokePing), I'd just run my own script instead of recompiling Zabbix.

Alexei, could the Simple Checks parameter be extended to allow a external_script(/path/to/my/script) Key?

Cheers,

Jonathan

jyoung
09-04-2005, 03:32
Many times you want the monitoring server to run the script, for example for checking DNS servers instead of just checking if the process is running the server could be queried from the monitoring server and checked to see if a value response was received.


I currently do this with Zabbix and is what I think you're asking for.

I am a Nagios user as well and love many of the Nagios checks that are not provided yet in Zabbix(NTP, DNS checks as you've described, HTTPS cert checks).

I was running Zabbix 1.0 and just went to 1.1alpha7. I can't recall what exact alpha this was included in, perhaps it was 7, but you can pass arguments in User Parameters. The down fall of this has been that they must be in order, IE:
UserParameter cust_dns_check ,/usr/local/zabbix/bin/check_dns
(yeah, you might recognize that 'check_dns' script, it's the one you use with Nagios.)

Now when I add my items I can add,
cust_dns_check[ns1.mynameserver.com]
and it's executed as
/usr/local/zabbix/bin/check_dns ns1.mynameserver.com

likewise I could add the item as
cust_dns_check[ns1.mynameserver.com 10.30.0.4]
and it would be executed as
/usr/local/zabbix/bin/check_dns ns1.mynameserver.com 10.30.0.4

Now you say, "Wait, that does nothing for me, it's not even a valid argument nor would it return anything I can use."

Here is what I use for my "custom" dns check:

UserParameter=cust_check_dns[ns1],/usr/local/nagios/libexec/check_dns -H site.tocheck.net -s ns1.nameserver.com -a 10.30.0.3 |awk '{if ($2=="ok") print "1"; else print"0"};'

If you use the nagios check_dns command you know what this does.
I'm asking it to check the DNS entry of site.tocheck.net from nameserver ns1.nameserver.com for the addres 10.30.0.3. This returns a line of text that we can use awk on to filter out for good results.

But this is completely different from what i was talking about before, right?
Yes, because passing many arguments in a UserParameter gets VERY UGLY in the WebUI. (perhaps this is something that can be looked into)


Now to combine what the two examples I've just given. When passing just one argument the look of the Zabbix UI is good. So I MUST check my NTP servers because time is Crucial. If I cannot access an NTP server I want to know right away so that I can ensure my time is correct.

/etc/zabbix/zabbix_agentd.conf:
UserParameter=cust_check_ntp ,/usr/local/zabbix/bin/check_ntp.sh

/usr/local/zabbix/bin/check_ntp.sh:
#!/bin/bash

SERVER=$1

/usr/local/zabbix/bin/check_ntp $SERVER |awk '{if ($2 == "OK:") print "1"; else print "0"}'

--//--

Now I can add custom checks to dfiferent NTP servers without even editing the agentd.conf file again. All I do is add an item for "cust_Check_ntp[ip.addr.here]" and I'm ready to roll. If this one goes down and out for some odd reason I simply whipe out the item and add a new one for the NTP server I have replaced it with.

Of course, with NTP servers you want to be nice, so don't check if you have access more than every 5 minutes. Your clock won't tray too far away in 5 minutes anyways.

Ick. I just looked at all that and it's messy. Messy messy.

I'm too lazy to edit, it's Friday and time to head home from work. Ask if this isn't clear and I'll try to re-explain it all.

Side note: I do all my NTP, DNS and HTTPS checks from one server. Then each of the servers themselves check if the service is up as well.

Jesse

klavs
09-04-2005, 19:21
But doing the checks in the agent - means you have to "assign" the check to a host(could ofcourse be the servers agent), that is not really hosting the service - ie. have a "scriptsserver". That's pretty ugly IMHO.
with f.ex. https responsetimes - I'd like to check remotely - and attach the check to the server which actually hosts the http-site. This can only be done, if it's a test run on the server, like simple_check is.

jyoung
09-04-2005, 21:15
But doing the checks in the agent - means you have to "assign" the check to a host(could ofcourse be the servers agent), that is not really hosting the service - ie. have a "scriptsserver". That's pretty ugly IMHO.

Partially ugly, just becuase it required a bit of work. Zabbix has not been around nearly as long as netsaint/nagios thus it must still build on it's offerings. An NTP check or HTTPS cert check could easily be added into Zabbix at a later date because we devise a way to do it. Then it become less ugly.

IMHO it is NOT very ugly as it stands. If I wanted to do the same thing in Nagios I would be required to add the services checks to one of my servers. Most of my UserParameter checks are one line in zabbix_agent.conf and I can add many checks off of the sole UserParameter to check the service status of multiple servers.

All you're really saying is Zabbix needs a few plug-in packages like Nagios and it won't be ugly anymore.


with f.ex. https responsetimes - I'd like to check remotely - and attach the check to the server which actually hosts the http-site. This can only be done, if it's a test run on the server, like simple_check is.

How would you even propose to get this? If you're wanting to check the response time you HAVE to have a Zabbix Agent running on the remote side. If you don't you're "response" time will not be qualified. Your HTTPS server does not have any idea when the remote site issued its request for service. Your HTTPS site will only know 2/3s of the story -- that it has acknowledged the connection and the remote host is now requesting a page.

klavs
09-04-2005, 22:04
Well - this patch: http://www.zabbix.com/forum/showthread.php?t=445 seems to support it.

In regards to measuring http-responsetimes, it is actually rather common to do it from the server. Usually it is on the same LAN - so there's no network delay, and as such the response-time is an accurate measurement. Then ofcourse, one should check connectivity outwards, but that's another story.

Both BigBrother, BigSister, Nagios. etc has checks (or items as they are called in Zabbix), which the server checks for directly, and not through a local agent.

I can see no argument for Zabbix not having a patch, for adding checks to the zabbix_server.conf - like userparams are added to zabbix-agentd.conf. Pref. it could be the same code that agentd has for this, reused in the server.

Tonyb
09-04-2005, 22:21
All you're really saying is Zabbix needs a few plug-in packages like Nagios and it won't be ugly anymore.

That’s not what we are saying at all. It is ugly because if we want to run a script on the monitoring server to monitor remote host they checks are listed as items on the monitoring server and not the host that you are checking. The items for the monitoring server will quickly grow into the hundreds and become difficult to manage.

It would be much cleaner if there were a way in Zabbix to define external commands. For example you could define an external command at the server
like:
ServerCommand=check_dns ,/usr/local/nagios/libexec/check_dns -H $1 -s $HOSTNAME -a $2

This way you can attach the item to the actual host that it is checking and you don't have to write a shell script for each external plug-in.

This does pose one other problem though. Nagios plug-ins (for example) allow for the plug-in to decide if the check should be Ok, Warning, Critical, or UNKNOWN. One plug-in might check more than one thing, for example the dns plug-in checks to see if the server responds at all and also check to see if it responds with the correct address. I don't know how you could use that with the historical monitoring features of Zabbix. If you wanted to graph DNS server response time then you would have to have the plug-in return the response time. You could then of course setup a trigger to check if the response time was under a specific amount of time but what happens if the server doesn't reply at all?

charles
10-04-2005, 04:17
I see your guys point, and it would be a good addition to Zabbix imo :)

jyoung
10-04-2005, 04:40
That’s not what we are saying at all. It is ugly because if we want to run a script on the monitoring server to monitor remote host they checks are listed as items on the monitoring server and not the host that you are checking. The items for the monitoring server will quickly grow into the hundreds and become difficult to manage.
Okay, I understand now. I deal with a relatively smaller cluster, so the grouping has not exceeded 100 checks on the main monitoring server. For a cluster of machines less than 30-35 I can see having all checks on the monitoring server being easier to mintor, although when the number of servers is greater than that the page would get rather ugly and hard to manage.


It would be much cleaner if there were a way in Zabbix to define external commands. For example you could define an external command at the server
like:
ServerCommand=check_dns ,/usr/local/nagios/libexec/check_dns -H $1 -s $HOSTNAME -a $2

This way you can attach the item to the actual host that it is checking and you don't have to write a shell script for each external plug-in.

Agreed. I wrote the shell scripts as a hack for something that has net yet been implemented.

http://www.zabbix.com/forum/showthread.php?t=419&highlight=UserParameter

Post #4, it looks like Alexei has plans for this implementation lets hope he's able to get it in the final 1.1 release.


This does pose one other problem though. Nagios plug-ins (for example) allow for the plug-in to decide if the check should be Ok, Warning, Critical, or UNKNOWN. One plug-in might check more than one thing, for example the dns plug-in checks to see if the server responds at all and also check to see if it responds with the correct address. I don't know how you could use that with the historical monitoring features of Zabbix. If you wanted to graph DNS server response time then you would have to have the plug-in return the response time. You could then of course setup a trigger to check if the response time was under a specific amount of time but what happens if the server doesn't reply at all?


Indeed, I ran into a snag here as well. The nagios NTP check monitors both access to the NTP server and offset in relation to that server(among other things). To monitor both of these I was required to make to shell scripts for each.

No repsonse would be marked as a '-' would it not? I'm unsure how this is/could_be analyzed within triggers. Is it analyzed as a check.last(0)=0?

jyoung
10-04-2005, 05:03
Well - this patch:
In regards to measuring http-responsetimes, it is actually rather common to do it from the server. Usually it is on the same LAN - so there's no network delay, and as such the response-time is an accurate measurement. Then ofcourse, one should check connectivity outwards, but that's another story.

Both BigBrother, BigSister, Nagios. etc has checks (or items as they are called in Zabbix), which the server checks for directly, and not through a local agent.

I can see no argument for Zabbix not having a patch, for adding checks to the zabbix_server.conf - like userparams are added to zabbix-agentd.conf. Pref. it could be the same code that agentd has for this, reused in the server.
Okay, I believe I just did not see what you were wanting. You want the remote agent to ask the server to carry out the action. Thus the agent on host#2 asks ther server on host#1 to do an HTTP get. The server then stores this response time in the DB under a trigger owned by host#2. Do I have that right?

That indeed would be very useful. My apologies for not understanding earlier.

jyoung
10-04-2005, 05:08
As a Nagios user another "feature" I with was around was the repeated alerting after an aloted amount of time. I found reference to the action found here:
http://www.zabbix.com/forum/showthread.php?t=309&highlight=cron
but it seems like an ugly, however workable, hack to me. This would be excellent if it were built in. It was nice having Nagios re-alert me after 4 hours if the problem had not been dealt with yet.

In the case of HTTPS Cert checking this re-alerted every 24 hours just as a constant reminder that the cert was about to expire and that I should have started the re-issuing process already.

Do any other ex-Nagios/wanning-Nagios users miss this functionality as well?

klavs
10-04-2005, 11:48
You want the remote agent to ask the server to carry out the action. Thus the agent on host#2 asks ther server on host#1 to do an HTTP get. The server then stores this response time in the DB under a trigger owned by host#2. Do I have that right?
Almost. I want to be able to set userparams in the server - so I can add checks, which the server does not even try to retrieve from an agent - but executes itself - with the result owned by #2 as you say - but with #1 being the server - not an agent. Perhaps a dedicated "serveragent" - which is used for "remotechecks".

That indeed would be very useful. My apologies for not understanding earlier.
No apologies needed :)

hrabbit
29-07-2005, 09:07
I realise this thread is rather old now and possibly outdated but recently after having a good nose around Zabbix I have decided to give it a go at the large and ugly task of monitoring our current network.

Nagios handles the network at the moment.. One centralised server does everything in one hit. We use plugins for everything....

HTTP
DNS
FTP
SSH

run from the central Nagios server and request access on the given services port that gives the status of the service the plugin is trying to access.

This may seem like a problem from the perspective of time differences but in the real world, can a user trying to access a web page from one of your servers or get a DNS result from your server see the difference in speed and time variants? Of course they can but they see it "across" the network, not from localhost.

We monitor 250 hosts and over 1500 services across these hosts. Configuring monitoring agents on all these machines is a very large headache.

We use NRPE to allow collection of details on load, memory and disk space to be brought back to the central Nagios server.

Nagios allows us to have thresholding of services as well.. (eg. We have a web server that may be getting clobbered so we see that it timed out once. Does this mean that the service is unavailable? of course not.. it has a problem however.. we set up some rules that say if after 5 minutes and 5 checks over this period that the server still has a problem, notify somebody about it.)

We have plugins for notifications, allowing us to set up multiple sms gateways (eg. GSM modem locally, cheap international http gateway for low priority notifications). Jabber and ICQ to name a few.

Nagios also allows for large scale templating while active. EG. I can specify by default that every host has to be pinged.. I have a rule set that says ping * and its all done.

Hostgroups allow for notification of all hosts inside this group to get a page about a particular subset of hosts that have issues.

We have on average about 60 plugins and of these about 45 are custom written and maintained. This number only includes the Nagios local plugins and does not include the many that are written to be run from NRPE itself.

Saying all this, I have quite taken to Zabbix due to the MySQL backend and Web Interface for configuration (+ the added benefit of the Screens Feature + Map features) but in the grand scale.. without the use of mass plugins I can't see how to implement this without some fairly major headaches.

I suppose the real reasoning of this post is that unless I can get Zabbix to handle the multitude of plugins I have written to handle the network at present (This include NRPE client -> server connectivity) I would have to pass on this project.

The main feature that Nagios has that I cannot live without is;
Tactical Overview of problem only hosts and services
We have a projector (wallboard) that displays this information 24/7 and without something with as much simplicity I would simply go crazy.

Anyway, I may be jumping the gun on some of these details and as such, please blatantly shoot me down in a pile of dust.
I do love the Zabbix project and love the concept overall so please don't anybody take this as a flame war about Nagios vs Zabbix at all.

Alexei
20-08-2005, 09:44
Thanks for the post. I appreciate it.

I have a couple of comments though:

1. A Tactical Overview Screen will be introduced 1.1. I already have design, it just has to be coded.

2. You're saying that configuration of ZABBIX agents is difficult for large number of hosts. I've never used Nagios, just curious, does Nagios require configuration and setup of the plugins on monitored servers? How it works?

3. In ZABBIX you may define that an event will be triggerred in case if a WEB server is unavailable for, say, 5 minutes. Use trigger expression {host:http.max(300)}=0.

4. ZABBIX does provide interface to SMS, pager, Windows messaging, whatever. Just write your own shell or Perl script, configure new media, and the script will be used for notifications. Easy!

Jon
21-08-2005, 15:31
I followed this thread through because I also have a need to use an external script to do extended monitoring of remote services, e.g. to check that a HTTP response contains a certain string.

I came up with this very small patch (attached) to allow simple checks to be defined in the web GUI using the syntax e.g. ext[/usr/bin/myscript], which runs myscript (in zabbix_server) and uses the floating point result on stdout.

I'm new to zabbix and don't know the code at all and I've done limited testing on this patch, so USE WITH CAUTION. I just thought it might be useful to put the patch out there so that those who know the code can comment on the wisdom or otherwise of what I've done.

(In addition to ext[/usr/bin/myscript] I threw in ext_str[...] for string values but a quick test suggests the latter is not working).

primos
21-08-2005, 15:39
Well known topic, by now I hope resolved by adding external checks to beta1(haven't seen beta yet).