Ad Widget

Collapse

Zabbix vs Nagios

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • skogan
    Member
    • Nov 2007
    • 70

    #16
    Originally posted by Aly
    Originally Posted by swaterhouse View Post
    I know that well written patches are almost always incorporated into the project.
    Good point.
    Really. So, when people report a bug and submit a patch that fixes it and you ignore the whole thing, it's always because the patch doesn't meet your strict code quality standards?

    Maybe I should remind you of some of the problems with the 1.4 version? Like, for example, inability to effectively schedule downtimes? Or, maybe, the virtually useless "IT Services" section? Have you counted the number of threads that were posted here on this board about this problem and how long did it take to get any kind of response from you about it? And the empty section of the document? So please, spare me the "well written" thing.

    You made a good product, but it still needs work until it really matches the expectations you set on your home page. And, most of all, you need to stop hiding your head in the sand, and be more open to outside input. Your approach to it right now is of outright dismissal.

    I don't know, maybe you have decided to dismiss the 1.4 problems and concentrate on 1.6, maybe 1.6 is going to fix EVERYTHING - I really hope so. But I'm pessimistic

    Comment

    • skogan
      Member
      • Nov 2007
      • 70

      #17
      Originally posted by Emir Imamagic
      Hi,

      2. Nagios plugins describe state of monitored service (or item in Zabbix) with three parameters: status (number), summary data and long data (only in version 3). This enables you to track state and get more detailed information what happened to your service (e.g. output of the underlying command used by plugin, response from monitored service). Zabbix items have only one. So you need to choose between (1) using status and loose details, or (2) use detailed output and loose availability graphs or (3) use two separate items for status and output (which makes triggers much more complicated). (If there is a another way that I'm still not aware of would be glad to hear it).
      I agree, this is a very important thing. Right now I simulate this functionality with having an external script save the second data item for the next retrieval, but it would be nice to have that built in.

      3. Ability to schedule a check for specific service/host through the web interface. This feature is very useful if you want to check that you really repaired something and don't want to wait for the next scheduled check.
      I agree - a very important thing.

      The thing with Zabbix is that it's built around certain assumptions, one of them being that checks are run very frequently. It's not like that in real life - especially with web scenarios. Some clients just don't like their sites pounded by monitoring software.

      However due to all these issues with availability monitoring and great deal of experience with Nagios we're planning to keep Nagios for monitoring availability of services and use Zabbix only for gathering performance data. Anyone has comments on such approach?
      Here we are attempting to migrate fully to Zabbix. Right now we are running Zabbix and Nagios concurrently and analyzing the differences between the two. So far, Zabbix seems to be much more false-alarm prone, especially in web scenarios. There are two reasons for that:

      - Nagios's checks have soft and hard state. Service status change is reported (and recorded) only when it reaches hard state and service check intervals can be set to be different for soft and hard states.
      So, for example, if I have a webinject scenario running every 15 minutes and it fails, the status changes to Critical soft, and the next check is run in 1 minute. Only if this one fails too, would it go to Critical hard and notify about the problem. In the same situation, Zabbix will report a problem immediately. Usage of min(#2) function and the likes does not solve the problem, because then real problems would be reported 14 minutes too late.

      - Second possible problem is the way timeouts are set in web scenario steps. Web scenarios frequently fail due to timeouts no matter how high one sets them. This could be a bug.

      From data collection stand point Zabbix is simply beautiful. I haven't seen any other system doing it so gracefully, no doubt about that.
      Last edited by skogan; 26-03-2008, 20:17.

      Comment

      • Alexei
        Founder, CEO
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Sep 2004
        • 5654

        #18
        Originally posted by skogan
        So, for example, if I have a webinject scenario running every 15 minutes and it fails, the status changes to Critical soft, and the next check is run in 1 minute. Only if this one fails too, would it go to Critical hard and notify about the problem.
        In other words the notifications will be sent after 16 minutes of WEB downtime. What do you monitor, blogs?

        Originally posted by skogan
        In the same situation, Zabbix will report a problem immediately. Usage of min(#2) function and the likes does not solve the problem, because then real problems would be reported 14 minutes too late.
        ZABBIX has absolutely fine control when and how you'd like to be notified. After N seconds of WEB downtime? Ok. After M unsuccessful checks? No problem. If number of unsuccessful checks exceeds 10% within last 15 minutes? Certainly. If average response time for a WEB checks exceeds 2.5 seconds within last 20 minutes? Sure.

        You are free to define as many triggers as you want. Trigger dependencies is your friend.

        Info: "WEB failed. Hmm, something to look at if I have time."
        Average: "WEB is down for 3 minutes. Send me an email."
        Disaster: "WEB is down for 10 minutes. Send me a SMS and restart Apache."

        Think about it.
        Alexei Vladishev
        Creator of Zabbix, Product manager
        New York | Tokyo | Riga
        My Twitter

        Comment

        • skogan
          Member
          • Nov 2007
          • 70

          #19
          Originally posted by Alexei
          In other words the notifications will be sent after 16 minutes of WEB downtime. What do you monitor, blogs?
          For the record, I monitor a heavily loaded eCommerce web site for a big client. And I have multiple scenarios checking the site in different areas. Running these scenarious too frequently is simply counter productive. The more critical ones I run every 5 minutes, the less - 15 and more. What I don't like is when I get an alarm that clears itself 5 minutes later - that's not nice. On the other hand when shit really hits the fan I don't want the larms start arriving 5 minutes after the fact.

          ZABBIX has absolutely fine control when and how you'd like to be notified. After N seconds of WEB downtime? Ok. After M unsuccessful checks? No problem. If number of unsuccessful checks exceeds 10% within last 15 minutes? Certainly. If average response time for a WEB checks exceeds 2.5 seconds within last 20 minutes? Sure.
          Again, all this is good and cool but ONLY if you check frequently. I guess that explains why check intervals are measured in seconds and the default interval for web scenarios is 60 seconds.

          You are free to define as many triggers as you want. Trigger dependencies is your friend.
          Actually, now that you mention it, trigger dependancies have a little problem. Slave triggers preserve their state when their master triggers override them. What they should do is change their state to "unknown".

          Info: "WEB failed. Hmm, something to look at if I have time."
          Average: "WEB is down for 3 minutes. Send me an email."
          Disaster: "WEB is down for 10 minutes. Send me a SMS and restart Apache."

          Think about it.
          I've been thinking about it for 6 months now - all this is good only if you check frequently. It becomes totally useless when the check interval rises to 5 minutes or more - as you said, we are not monitoring blogs.

          Comment

          • Emir Imamagic
            Member
            • Mar 2008
            • 67

            #20
            Originally posted by Alexei
            ZABBIX has absolutely fine control when and how you'd like to be notified. After N seconds of WEB downtime? Ok. After M unsuccessful checks? No problem. If number of unsuccessful checks exceeds 10% within last 15 minutes? Certainly. If average response time for a WEB checks exceeds 2.5 seconds within last 20 minutes? Sure.
            this is all true.

            But, Nagios has the ability to change the frequency of checks automatically in the case of soft state change. This means that once the first problem occurs following checks will be performed with different frequency until the hard state is reached. The advantage here is that you can avoid false alarms but still get the notification in case of real problem in reasonable period.

            In case of Zabbix, you can only rely on normal check frequency. In order to perform multiple checks before switching on trigger you have to wait for N normal periods. So, if you want to get notified in reasonable period you need to make checks more frequent. In some cases increase of check frequency is overhead.

            For example (not really a good one, but hopefully illustrates my point), if you check certificate lifetime of SSL server. Reasonable period for this check would be something like one day. Let say that in one moment SSL server is overloaded and the check timeouts. In case of Nagios you can simply say ok check again in 15 minutes. In case of Zabbix, with concept of multiple checks you would need to wait for another 24 hours for the next check.

            Cheers,
            emir

            Comment

            • bee
              Senior Member
              • Jun 2007
              • 133

              #21
              Hi All,
              Back few month ago, i have made comparison between WhatsUp Gold and ZABBIX. Main idea on this comparison is "Why i have to choose WhatUp Gold when ZABBIX can provide all WhatsUp Gold's function/feature ".

              This comparison, made based on function/features in WhatsUp Gold that covered by ZABBIX. This comparison may not 100% accurate, and feel free to give some feedback on it. If you think you can complete this initial comparison, then it will be great.

              In my own opinion, i'm rather convenient with ZABBIX *even sometimes it's very buggy, and i believe ZABBIX team manage it well*, instead of WhatsUp Gold. ZABBIX's data presentation is more simple and easy to understand compared with WhatsUp Gold.

              Thanks,
              BEE
              Attached Files

              Comment

              • sege
                Member
                • Jan 2008
                • 40

                #22
                The soft/hard state from Nagios is something I also would love to have in Zabbix. Not by doing several more triggers which will bloat up my views in the GUI, instead in the trigger/item.

                Make the trigger be able to change the items check time?

                I would also like to make web scenario every 15 minutes but every minute if it fails one time. Everything to make the system only make an alert when something is broken. One of the worst things are to have to wake up and when you finally efter 2-3 minutes get to your computer everything works. Not ok.

                Something i REALLY would like to have is scheduled downtime, don't know if this is planned for 1.6?

                Comment

                • bbrendon
                  Senior Member
                  • Sep 2005
                  • 870

                  #23
                  Whats Up Gold and Zabbix discussed on the same screen??? Am I the only one who doesn't get it? Let me find the nearest garbage so I can unload my lunch.
                  Unofficial Zabbix Expert
                  Blog, Corporate Site

                  Comment

                  • Alexei
                    Founder, CEO
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Sep 2004
                    • 5654

                    #24
                    Originally posted by Emir Imamagic
                    But, Nagios has the ability to change the frequency of checks automatically in the case of soft state change. This means that once the first problem occurs following checks will be performed with different frequency until the hard state is reached. The advantage here is that you can avoid false alarms but still get the notification in case of real problem in reasonable period.
                    I agree. This seems to be an useful piece of functionality. I am not quite sure if this can be easily implemented in ZABBIX...
                    Alexei Vladishev
                    Creator of Zabbix, Product manager
                    New York | Tokyo | Riga
                    My Twitter

                    Comment

                    • skogan
                      Member
                      • Nov 2007
                      • 70

                      #25
                      Originally posted by Alexei
                      I agree. This seems to be an useful piece of functionality. I am not quite sure if this can be easily implemented in ZABBIX...
                      I have actually been working on introduction of soft states to Zabbix for some time now. There seem to be two viable minimum-impact solutions:

                      1) Introduction of additional "soft" (softmax, softmin, softavg, etc...) functions that, when executed, advance the next check time for the item processed by the functions. Combined with clever trigger expressions this would provide an elegant solution. However, it's not without problems:

                      - The entire logic of function value updates will need to be revised: "soft" functions will have to be evaluated only when the expression requires it, and not every time there is a new value. The logical expressions will need to be evaluated in a similar manner to programming languages. For example:

                      A | B --- B is evaluated only if A is false.
                      A & B --- B is evaluated only if A is true.

                      - Trigger conditions will become somewhat complex to construct and read.

                      A big advantage of this solution that it is completely clean: no changes to the database schema, no modifications to the interface and no breakage of existing data.

                      2) Addition of a table in the database, used to describe the soft state of the trigger. This will also necessitate addition of a configuration item in the interface as well as modification of logic treating trigger condition to include implementation of the soft/hard state.

                      Either one of the solutions is not very hard to implement. First solution is more elegant in the code and hard on the user (the trigger expressions become simply MONSTROUS), the second one is hard on the code (and history data) and easy on the user.

                      So, Which one do you prefer?

                      Comment

                      • bbrendon
                        Senior Member
                        • Sep 2005
                        • 870

                        #26
                        I think the 'easy on the user' is a better solution.

                        Though, it may make sense to change some of the zabbix architecture so the its elegant and easy on the user.
                        Unofficial Zabbix Expert
                        Blog, Corporate Site

                        Comment

                        • Alexei
                          Founder, CEO
                          Zabbix Certified Trainer
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Sep 2004
                          • 5654

                          #27
                          Originally posted by infinity005
                          I think the 'easy on the user' is a better solution.
                          I would prefer to have a solution, which is easy both for users and ZABBIX.
                          Alexei Vladishev
                          Creator of Zabbix, Product manager
                          New York | Tokyo | Riga
                          My Twitter

                          Comment

                          • skogan
                            Member
                            • Nov 2007
                            • 70

                            #28
                            Originally posted by Alexei
                            I would prefer to have a solution, which is easy both for users and ZABBIX.
                            That may not be possible this time. This is a pretty big change, after all. And, by the way, the #2 solution I presented in my earlier post has another drawback: In complex triggers, it would be hard to decide which one of the data items to advance. #1 solution is much better in this area - the items are practically specified by the user.

                            Comment

                            • Alexei
                              Founder, CEO
                              Zabbix Certified Trainer
                              Zabbix Certified SpecialistZabbix Certified Professional
                              • Sep 2004
                              • 5654

                              #29
                              This is an off-topic in this thread. We may continue discussing this elsewhere.
                              Alexei Vladishev
                              Creator of Zabbix, Product manager
                              New York | Tokyo | Riga
                              My Twitter

                              Comment

                              • skogan
                                Member
                                • Nov 2007
                                • 70

                                #30
                                Originally posted by Alexei
                                This is an off-topic in this thread. We may continue discussing this elsewhere.
                                That's a capital idea. Can you please separate this discussion into a new topic somewhere?

                                Comment

                                Working...