Ad Widget

Collapse

Host status always green ?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • joshuamcdo
    • Jan 2026

    #1

    Host status always green ?

    I was asked to check to make sure that Monitoring and alerting was happening as expected on some servers. I thought this would be quick because they are all green with the exception of 2 that are red and have been for a long time. Well, I sshed into one of the nodes, and stopped the agent.. I waited and waited for the agent status to turn red.. It's no happening... I have tried just about everything to understand why and have failed. I get no alerts, it just sits there all green and pretty even though I stopped the agent hours ago.

    Any ideas?

    j
  • neogan
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2011
    • 118

    #2
    What zabbix version do you use?
    What is in the zabbix_server log about this hosts?

    Comment

    • joshuamcdo

      #3
      Interesting...

      Originally posted by neogan
      What zabbix version do you use?
      What is in the zabbix_server log about this hosts?
      You ask great questions...

      Zabbix server v2.0.3 (revision 30485) (03 October 2012)
      Compilation time: Nov 30 2012 07:52:44

      Running on Unbuntu

      The server logs for zabbix don't contain much...
      A-lot of this, and I mean A-LOT.

      <snip>
      Use "--url" to define URL.
      Use "--url" to define URL.
      Use "--url" to define URL.
      Use "--url" to define URL.
      Use "--url" to define URL.
      ....
      ......
      </snip>

      I checked the dmesg logs and /var/log/syslog and didn't find any 'system' network related errors.


      Some strange errors..

      10816:20130613:104228.191 Zabbix agent item [vfs.fs.size[/,free]] on host [hostname failed: first network error, wait for 15 seconds
      10813:20130613:104231.769 Zabbix agent item [vfs.fs.size[/,pused]] on host [hostname failed: another network error, wait for 15 seconds
      10812:20130613:104231.775 Zabbix agent item [vfs.fs.size[/,pfree]] on host [hostname] failed: another network error, wait for 15 seconds


      10829:20130613:105930.915 item [hostname:AELoginCheckS1] became not supported: Not supported by Zabbix Agent
      Use "--url" to define URL.
      10831:20130613:110339.322 item [hostname:AELoginCheckS1] became supported


      Any thoughts? I know we have to use DNS to reach these agents.. Kind of wondering if DNS isn't acting a little wonky.

      J

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        2.0.3 has some bugs. If I were you, I would move up to 2.0.6

        How many hosts are you monitoring? Have you looked at your graphs for the Zabbix internal checks that provide information on performance of the various Zabbix server processes?

        I personally build a screen for my Zabbix App server, Zabbix DB server and my Zabbix proxy servers using these graphs. See my screenshots below for my App server. This information can guide you in dynamically tuning your Zabbix server parameters and give you insight into how well it is functioning at any given time.

        Regarding your network errors in the Zabbix log, I had similar issues with mine and I resolved the bulk of that by increasing my Timeout= value to 10 (In zabbix_server.conf). The default value is 3. If you change that, you need to restart your Zabbix server process to take effect.

        Why do you say you have to use DNS to reach your agents?
        Attached Files

        Comment

        • joshuamcdo

          #5
          Dns.

          I have to use DNS because some network setup stuff.. ( CLOUD )

          I am monitoring around 100 hosts tops, but the server it's self never has a load..

          I was reading and found that "zabbix[host,agent,available]" has been deprecated.
          So I created the trigger that was suggested in the post.


          I looked at mine, and have the following :

          Server {HOST.NAME} is unreachable
          {Template_Lx:zabbix[host,agent,available].last(0)}=0

          So I created the one they suggested...
          Zabbix agent on {HOST.NAME} is unreachable for 5 minutes
          {Template_Lx:agent.ping.nodata(5m)}=1

          While the host status stayed green, the trigger started emailing alerts like made. Some were warning that it was unable to reach the host for 5 minutes, some for 1 hour and 47 minutes... I am at a loss as to what is causing that too..

          J

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            Originally posted by joshuamcdo

            I am monitoring around 100 hosts tops, but the server it's self never has a load..

            J
            Server has never had a load on what? CPU? That isn't really a good indicator of Zabbix processes performance though.

            Take a look at those graphs I mentioned above. They are already pre-made for you and you can find them in Monitoring --> Graphs --> <server>
            Attached Files

            Comment

            • joshuamcdo

              #7
              Originally posted by tchjts1
              Server has never had a load on what? CPU? That isn't really a good indicator of Zabbix processes performance though.

              Take a look at those graphs I mentioned above. They are already pre-made for you and you can find them in Monitoring --> Graphs --> <server>
              I inherited this setup, and I am starting to think it's just plain hosed.
              I don't have the same server groups as you mentioned in your post. Nor do I have the charts you highlighted. I think I am going to delcare no joy and rebuild this from the ground up. Dumping and importing what I can from the old. But to be honest, I am not sure I want ANYTHING from the current setup to enter the new.

              J -

              Comment

              • tchjts1
                Senior Member
                • May 2008
                • 1605

                #8
                Slow down on the rebuilding.

                Groups are named whatever anybody wants to name them. I am just basically showing you how to get to those graphs. Your groups can certainly be named differently. Look for whatever group your Zabbix server is in.

                If those graphs aren't there for your Zabbix server, it may simply be that you don't have that template attached to your Zabbix server. The template in question is Template App Zabbix Server.

                Comment

                • joshuamcdo

                  #9
                  There are so many problems with this install.

                  Originally posted by tchjts1
                  The template in question is Template App Zabbix Server.

                  The template you suggested does not exist on the server I am working on.
                  I am mostly monitoring things in EC2, which is where the server resides, so there shouldn't be connectivity issues. The firewall rules are all correct and in place. Plus they installed it on Ubuntu which is just a no no for us....

                  J

                  Comment

                  • joshuamcdo

                    #10
                    Screen shots

                    Originally posted by tchjts1
                    The template in question is Template App Zabbix Server.
                    I have attached a couple of screen shots with the hostnames chopped out of them.
                    I tried, I cant... Exceeded quota limit..

                    J
                    Last edited by Guest; 13-06-2013, 22:44.

                    Comment

                    • joshuamcdo

                      #11
                      Screen shots

                      Screen shots

                      Comment

                      • joshuamcdo

                        #12
                        Originally posted by tchjts1
                        Regarding your network errors in the Zabbix log, I had similar issues with mine and I resolved the bulk of that by increasing my Timeout= value to 10 (In zabbix_server.conf). The default value is 3. If you change that, you need to restart your Zabbix server process to take effect.
                        My current value is set to 30 ..

                        <snip>
                        ### Option: Timeout
                        # Specifies how long we wait for agent, SNMP device or external check (in seconds).
                        #
                        # Mandatory: no
                        # Range: 1-30
                        # Default:
                        Timeout=30
                        </snip>

                        Thanks,
                        J

                        Comment

                        • joshuamcdo

                          #13
                          Re: Help..

                          Originally posted by tchjts1
                          Server has never had a load on what? CPU? That isn't really a good indicator of Zabbix processes performance though.

                          Take a look at those graphs I mentioned above. They are already pre-made for you and you can find them in Monitoring --> Graphs --> <server>

                          I don't have most of the options you mention in your post/pic.

                          I can't seem to attach anything worth viewing, so I am not sure how your able but @ 125k, that's not enough space. The SSs that I am taking average around 256k.

                          I am trying to understand why the server I am running doesn't have those graph options and how to get them. I can tell you that the average system load on this machine is very very load, around 1%

                          I did complete another test..
                          while :; do echo -en "`date +%T-%D`\n Status=`zabbix_get -s nodenamehere.amazonaws.com -p 10050 -k "agent.ping"`\n" | tee -a /tmp/uptime.test;sleep 1s; done

                          I found several similar sections to the one below.

                          08:49:12-06/15/13
                          Status=1
                          08:49:13-06/15/13
                          Status=1
                          08:49:14-06/15/13
                          Status=0
                          08:49:36-06/15/13
                          Status=0
                          08:49:57-06/15/13
                          Status=0
                          08:50:13-06/15/13
                          Status=1
                          08:50:14-06/15/13
                          Status=0
                          08:50:28-06/15/13
                          Status=1
                          08:50:31-06/15/13
                          Status=1
                          08:50:32-06/15/13

                          Look at 08:49:14-06/15/13, the very next iteration occurs at 08:49:36-06/15/13, 22 seconds is lost.. The next iteration happens without success at 08:49:57-06/15/13 , 21 more seconds lost. The next time out happens at 08:50:28-06/15/13, 14 more seconds lost. I don’t know if it was waiting for the zabbix_get command to connect or what the deal is there. Knowing this works, either the server slowed to a crawl and that process hung or the zabbix_get command timed out / network error ?? .

                          I appreciate your help on this so far, and in the future.

                          J

                          Comment

                          • joshuamcdo

                            #14
                            Hrmmm..

                            Did I say something wrong ?

                            J

                            Comment

                            • tchjts1
                              Senior Member
                              • May 2008
                              • 1605

                              #15
                              No. I'm just out of suggestions.

                              Comment

                              Working...