Ad Widget

Collapse

IT Services Configuration Problem | Status always set to "OK" when changing config

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • henjin
    Junior Member
    • Jul 2007
    • 8

    #1

    IT Services Configuration Problem | Status always set to "OK" when changing config

    When adding a new IT service for SLA monitoring or changing the configuration of an already existing one, its status is set to "OK", regardless of its actual state.

    Steps to reproduce (adding new service)
    1. Create a new trigger or use an existing one ("syslogd not running" taken in this example)
    2. activating the trigger error condition
    3. Create a new IT Service and associate it with the trigger (works also when adding it as a child)
    Result: IT Service status is "OK", regardles of the actual triggerstatus (which is not "OK")

    Steps to reproduce (changing service configuration)
    1. Create a new trigger or use an existing one ("syslogd not running" taken in this example)
    2. Create a new IT Service and associate it with the trigger (works also when adding it as a child)
    3. activating the trigger error condition -> the it service will change status to error condition
    4. change the configuration of the it service (for example decrementing sla by 0.01)
    Result: IT Service status is "OK", regardles of the actual triggerstatus (which is not "OK")

    This works at all levels of the IT Services tree. Only the element which was added or changed will be set to "OK". Higher levels will keep the error state, but will loose the "reason" attribute, inherited from the lower levels.

    The Zabbix Version used is 1.4.1 on Fedora Core 7
    Last edited by henjin; 24-07-2007, 20:26.
  • Aly
    ZABBIX developer
    • May 2007
    • 1126

    #2
    Thanks for reporting this.

    That patch should solve it.
    Attached Files
    Zabbix | ex GUI developer

    Comment

    • henjin
      Junior Member
      • Jul 2007
      • 8

      #3
      Thank you for the patch.

      I applied and tested it, but it fixes the problem only partially.
      When a service is added or changed, an error condition is not propageted to its parents. Only services directly associated with a trigger are updated correctly.

      Consider the following service setup:
      Code:
      root
        |
        Test
           |
           Syslog Trigger
      Error Condition:
      Code:
      root -> OK
        |
        Test -> Problem: Syslogd not running
           |
           Syslog trigger -> Problem: Syslogd not running
      Changing configuration of "Test" or "Syslog trigger":
      Code:
      root -> OK
        |
        Test -> OK
           |
           Syslog trigger -> Problem: Syslogd not running
      Same situation when adding a trigger to "Test" that already is in an error condition.

      Comment

      • Aly
        ZABBIX developer
        • May 2007
        • 1126

        #4
        So true.. my mistake, haven't calculated properly Services status.

        P.S. Node 'root' is used only for more tree-like view, actually nothing affects on it's Status or other columns. Status removed.

        Hope this time nothing is missed.

        Fixing(apply on previous patch):
        Attached Files
        Zabbix | ex GUI developer

        Comment

        • henjin
          Junior Member
          • Jul 2007
          • 8

          #5
          Thank you again for the patch.
          But I have to bother you again

          Your services.patch fixes only part of the problem, and I discovered a new issue.

          First to the remaining "change config" problem.
          Situation as before:
          Code:
          root
            |
            Test -> Problem: Syslogd not running
               |
               Syslog trigger -> Problem: Syslogd not running
          Changing the configuration of the "Test node" will set its status to "OK":
          Code:
          root
            |
            Test -> OK
               |
               Syslog trigger -> Problem: Syslogd not running
          The new issue I discovered while testing your last patch happens when you delete a trigger node with an error condition. Its parent node will keep the error status, nevertheless the failing service node was removed. Even when the removed trigger regains "OK" status, the service will keep the error condition.

          Before deletion:
          Code:
          root
            |
            Test -> Problem: Syslogd not running
               |
               Syslog trigger -> Problem: Syslogd not running
          After deletion:
          Code:
          root
            |
            Test -> Problem: Syslogd not running
          Next problem with this issue: re-adding the trigger after its status changed to "OK" will propagate the error condition of the parent to its child.

          Befor addition:
          Code:
          root
            |
            Test -> Problem: Syslogd not running
          After addition (remember, the actual status of the "Syslogd not running" trigger is "OK"!)
          Code:
          root
            |
            Test -> Problem: Syslogd not running
               |
               Syslog trigger -> Problem: Syslogd not running
          This will only change, when activating the trigger error condition and then deactivating it (=setting the trigger status to "OK").

          I'm seeing forward to the next patch I can test

          P.S.: I did not expect the root node to change its status, but I think it was good idea to remove the status display from it. One possibly missleading information less.
          Last edited by henjin; 27-07-2007, 18:01.

          Comment

          • maxijose
            Member
            • May 2007
            • 36

            #6
            What do u do to apply this patch because I have some problem with it. I'm running zabbix_server 1.4.1.
            I saw in the patch that it modifies triggers.inc.php and services.inc.php.

            I used patch -p0 services.inc.php < services.patch

            Thanks for your help

            Comment

            • henjin
              Junior Member
              • Jul 2007
              • 8

              #7
              As mentioned in my first post I am using FC7, so when it comes to file locations, consider that your files could be placed at a different location.

              To apply the patches do the following things (after downloading them to /root):

              Code:
              cd /usr/share/zabbix
              patch -p2 < ~/patch.patch
              patch -p2 < ~/services.patch
              /usr/share/zabbix is the directory where the zabbix php files are placed by my distribution.

              You must apply the patches in the correct order or it will not work (at least I assume they build on each other).

              Comment

              • ProTON
                Member
                • Oct 2005
                • 77

                #8
                I have an issue with IT services after upgrade to 1.4 where one of the services constantly reports status FALSE too. Would these patches will solve the problem?

                Comment

                • henjin
                  Junior Member
                  • Jul 2007
                  • 8

                  #9
                  Apply them and test for yourself.

                  Or you could describe your problem more precisely so others can verify it

                  Comment

                  • Aly
                    ZABBIX developer
                    • May 2007
                    • 1126

                    #10
                    Here go's another patch(apply on previous patches).

                    IMPORTANT!!!
                    1 difference, from now on: if service is internal node, than it can't be linked to trigger. It'll be removed(from all internal nodes) silently on add,update or delete. Only leafs can be linked to trigger, parent nodes will gain status from leafs by algorithm type.

                    waiting comments from henjin
                    Attached Files
                    Last edited by Aly; 30-07-2007, 15:49.
                    Zabbix | ex GUI developer

                    Comment

                    • henjin
                      Junior Member
                      • Jul 2007
                      • 8

                      #11
                      There goes another patch through my testing environment
                      Sadly it still does not fix all problems. The second issue mentioned in my last testing post is there.

                      After seeing the problem still exits, I tested some more cases, maybe this helps tracking it down.

                      Starting point is the following situation (as above):

                      Code:
                      root
                        |
                        Test -> Problem: Syslogd not running
                           |
                           Syslog trigger -> Problem: Syslogd not running
                      As mentioned above, removing the "Syslogd not running" trigger, will leave its parent node "Test" in the error condition:

                      Code:
                      root
                        |
                        Test -> Problem: Syslogd not running
                      In this case, the SLA level would constantly go down, independent from the status of the former "Syslogd not running" trigger child.

                      I would expect that removing a trigger in error condition would remove this error condition also from its parent nodes.

                      Expected result:

                      Code:
                      root
                        |
                        Test -> OK
                      Setting the "Syslogd not running" trigger to "OK" status and readding it as child to the "Test" node will lead to following situation:

                      Code:
                      root
                        |
                        Test -> Problem: Syslogd not running
                           |
                           Syslog trigger -> Problem: Syslogd not running
                      Up to this point I did test before I made my last post to this topic. This time I continued from this situation by adding another trigger child. Adding a child assoziated with the "smtp not running" (status = "OK") would then lead to this:

                      Code:
                      root
                        |
                        Test -> Problem: Syslogd not running
                           |       Problem: Smpt not running
                           |
                           Syslog trigger -> Problem: Syslogd not running
                           |
                           Smtp trigger -> Problem: Smpt not running
                      This makes clear what is happening: all childs inherit the status condition of their parents.

                      The next step was to force the Syslog trigger to "OK" status. To achive this, syslog had to be deactivated (remember, the trigger is actually in "OK" status, contrary to what can be seen in it services monitoring) and activated again. The final situation for this test is as follows:

                      Code:
                      root
                        |
                        Test -> Problem: Smpt not running
                           |
                           Syslog trigger -> OK
                           |
                           Smtp trigger -> Problem: Smpt not running
                      This is pretty far from what I would expect to see. Syslogd is running and its trigger is "OK". Smpt is running, its trigger was never touched and is still in status "OK". Instead, the IT services monitoring tells me sla is going down and smtp is the reason for this.

                      I hope this is enough information to fix this problem, if not, I just discovered another strange behavior that can be triggered from this last situation.

                      Lets see with what changes you come up with patch number 4 Aly
                      Last edited by henjin; 31-07-2007, 20:44.

                      Comment

                      • Aly
                        ZABBIX developer
                        • May 2007
                        • 1126

                        #12
                        I don't now what to say,just BIG THANKS to henjin for his test's.
                        P.S. my problem was in that i tested it in more complex service tree...

                        It would be wiser to wait till 1.4.2 will be released. But here goes:
                        Attached Files
                        Last edited by Aly; 01-08-2007, 10:56.
                        Zabbix | ex GUI developer

                        Comment

                        • henjin
                          Junior Member
                          • Jul 2007
                          • 8

                          #13
                          And again another patch from Aly

                          And again I experimented with it
                          First the good news: I can verify that the status problem when removing a trigger node is gone.

                          Now the sad news: all other (not already fixed by previous patches) problems remain.

                          This means: Although when removing a trigger child node, the parent node's status changes to "OK" as expected, the SLA level still decrements as if the failing trigger node was still there. Changeing fixing the trigger issue and setting it to "OK" has no effect on the SLA behavior of the parent node.

                          My conclusion is that, although the node displays "OK" status, internally it is still in the error condition state and behaves like this.

                          This can also be seen with the "child nodes inherit error status from parents" problem (examples taken from my previous post):

                          Code:
                          root
                            |
                            Test -> OK
                          Setting the "Syslogd not running" trigger to "OK" status and readding it as child to the "Test" node will lead to following situation:

                          Code:
                          root
                            |
                            Test -> Problem: Syslogd not running
                               |
                               Syslog trigger -> Problem: Syslogd not running
                          Up to this point I did test before I made my last post to this topic. This time I continued from this situation by adding another trigger child. Adding a child assoziated with the "smtp not running" (status = "OK") would then lead to this:

                          Code:
                          root
                            |
                            Test -> Problem: Syslogd not running
                               |       Problem: Smpt not running
                               |
                               Syslog trigger -> Problem: Syslogd not running
                               |
                               Smtp trigger -> Problem: Smpt not running
                          This makes clear what is happening: all childs inherit the status condition of their parents.

                          The next step was to force the Syslog trigger to "OK" status. To achive this, syslog had to be deactivated (remember, the trigger is actually in "OK" status, contrary to what can be seen in it services monitoring) and activated again. The final situation for this test is as follows:

                          Code:
                          root
                            |
                            Test -> Problem: Smpt not running
                               |
                               Syslog trigger -> OK
                               |
                               Smtp trigger -> Problem: Smpt not running
                          The above mentioned behavior can still be seen with your actual patch Aly.

                          Will there be a patch number 5? I hope so

                          Comment

                          • Palmertree
                            Senior Member
                            • Sep 2005
                            • 746

                            #14
                            I'm having problems with the IT Services and SLAs as well. Looking into the code to see if I can figure it out.

                            Comment

                            • henjin
                              Junior Member
                              • Jul 2007
                              • 8

                              #15
                              Cool

                              Whatever you come up with, I will test it

                              Comment

                              Working...