Ad Widget

Collapse

Server status problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • PiotrIr
    Member
    • Jun 2013
    • 45

    #1

    Server status problem

    I have many sites with many servers under NAT monitored using Zabbix 2.0.6 active agents. My problem is monitoring status of servers, it shows me if some problems come up but it doesn’t show if server is down or restarted. I’ve tried to use {ACTIVE Template Windows Baseline:agent.ping.nodata(15m)}=1 changing time to even 24 hours but this shows me nearly all servers as down. I can see in the Windows Baseline template following enabled triggers:

    Server {HOSTNAME} is unreachable ({ACTIVE Template Windows Baseline:status.last(0)}=2)&({ACTIVE Template Windows Baseline:status.nodata(600)}=1)

    {HOSTNAME} has just been restarted {ACTIVE Template Windows Baseline:system.uptime.last(0)}<600

    But they seem not working.
    Could you advise how to get notified when server is restarted or down please?
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    What is your polling interval for those 2 items? I am using agent version 2.0.9 with those (active) items at 60 second intervals.

    I use the stock triggers and have no issues detecting a server reboot or agent unavailable.

    These are the triggers I have:

    Server restarted:
    Code:
    {Template OS Windows:system.uptime.last(0)}<600
    Agent unreachable:
    Code:
    {Template OS Windows:agent.ping.nodata(5m)}=1

    Comment

    • PiotrIr
      Member
      • Jun 2013
      • 45

      #3
      tchjts1,

      Thank you for your reply.

      Server restarted is exactly the same like in your example.

      agent.ping.nodata I've tried 5m, 15m, 60m and 1440m.

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        As mentioned in my earlier response:
        What is your polling interval for those 2 items?

        Also, when looking at one of your hosts under items and then under triggers, do they both have green checkmarks under the "Error" column, or are they showing red with an actual error?
        Attached Files

        Comment

        • PiotrIr
          Member
          • Jun 2013
          • 45

          #5
          Sorry, I misunderstood you.

          I'm not sure if pooling interval is the same as update interval (if not, could you tell me how to check this?). If so:

          agent.ping 30s
          system.uptime 300s

          both have green mark under the "Error" column.

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            Polling interval and update interval... same to me. Your settings are fine there.

            When you say they are green, you are looking at a host, and not at the template, right?

            The next step I would do is see if Zabbix internal processes are sufficient to handle the workload.

            Take a look at this post, at the last paragraph and the graphs that follow.

            If you assign that template to your Zabbix server, you will be able to see if Zabbix is struggling or not. Here is the post:

            Comment

            • PiotrIr
              Member
              • Jun 2013
              • 45

              #7
              Thank you for your reply.

              Yes, green is on host.

              Could you help me to interpret data please? Some processes are very busy but I'm not sure if they apply to the issue and eventually how to resolve problem.
              Last edited by PiotrIr; 26-06-2014, 10:17.

              Comment

              • tchjts1
                Senior Member
                • May 2008
                • 1605

                #8
                Those images are very hard to see because they are so small. But I will certainly be glad to help you interpret them.

                Can you take a screenshot of each graph and attach them as a separate images? (I use MWSnap3 for this purpose)

                Also, instead of using a 1 hour timeframe in your graphs, please use 1 day (24 hours) instead. This will give a better overall picture of your process usage.

                MWSnap3 will allow your screenshots to look like this when you upload them:
                Attached Files

                Comment

                • PiotrIr
                  Member
                  • Jun 2013
                  • 45

                  #9
                  tchjts1,

                  Once again, thank you for your help.
                  Pictures below. You will see around 45 minutes break in data - I shut down a server for this time.
                  Last edited by PiotrIr; 28-08-2014, 11:14.

                  Comment

                  • tchjts1
                    Senior Member
                    • May 2008
                    • 1605

                    #10
                    I would make a few adjustments to your zabbix_server.conf file.

                    You can see that your trapper processes are 100% busy all the time.
                    I would also increase your pollers a bit.
                    I would also allocate some more configuration cache.

                    So these are the settings you should adjust, then after you do that, you need to restart your Zabbix server process:

                    (Note that I leave the defaults in place with the comment sign # preceding that line so that I always know what the default setting is, and I put the new vales on a new line without the # sign)

                    I am only guessing that you are running with all the stock default values. If you have already modified those values, then increase them in small chunks until you get the desired results. If you are running on the default values, then theses below suggested settings may work for you.

                    For trappers:
                    Code:
                    ### Option: StartTrappers
                    #       Number of pre-forked instances of trappers.
                    #
                    # Mandatory: no
                    # Range: 0-1000
                    # Default:
                    # StartTrappers=5
                    StartTrappers=15
                    For pollers:
                    Code:
                    ### Option: StartPollers
                    #       Number of pre-forked instances of pollers.
                    #
                    # Mandatory: no
                    # Range: 0-1000
                    # Default:
                    # StartPollers=5
                    StartPollers=35
                    I would also bump up the unreachable pollers a bit from default:
                    Code:
                    ### Option: StartPollersUnreachable
                    #       Number of pre-forked instances of pollers for unreachable hosts (including IPMI).
                    #
                    # Mandatory: no
                    # Range: 0-1000
                    # Default:
                    # StartPollersUnreachable=1
                    StartPollersUnreachable=5
                    For configuration cache:
                    Code:
                    ### Option: CacheSize
                    #       Size of configuration cache, in bytes.
                    #       Shared memory size for storing host, item and trigger data.
                    #
                    # Mandatory: no
                    # Range: 128K-1G
                    # Default:
                    # CacheSize=8M
                    CacheSize=64M
                    Do not change the StartDBSyncers value. Leave that at 4.

                    Remember to restart the Zabbix server process after making the adjustments. Let things run for a few hours, then re-check your internal process graphs and see if things have improved.

                    Comment

                    • PiotrIr
                      Member
                      • Jun 2013
                      • 45

                      #11
                      tchjts1,

                      You are genius. This works like a charm! Thank you so much!

                      Comment

                      • tchjts1
                        Senior Member
                        • May 2008
                        • 1605

                        #12
                        You're welcome. Keep in mind my suggested settings are just that - suggestions. You may need to tweak them further for optimal performance.

                        These internal graphs should be reviewed on a regular basis, and particularly if you are adding in more hosts/items/triggers as time goes on.

                        One other point is that the template the internal items belong to also has built in triggers that you should have seen on the Zabbix dashboard. Specifically the one that would have triggered for trappers being more than 75% busy.

                        I would suggest that these triggers be taken seriously as it certainly affects Zabbix performance.

                        Comment

                        • PiotrIr
                          Member
                          • Jun 2013
                          • 45

                          #13
                          I will keep eye on this. Once again, thank you so much.

                          Comment

                          • tchjts1
                            Senior Member
                            • May 2008
                            • 1605

                            #14
                            I have an action set up to notify me by e-mail any time these items trip the thresholds .
                            Attached Files

                            Comment

                            • PiotrIr
                              Member
                              • Jun 2013
                              • 45

                              #15
                              This make perfect sense, thank you for advice.

                              I’ve noticed also small problem with mine Zabbix housekeeper process. When it runs on 100% it slows down the server a lot. I decreased items from 500 to 100 and will see if this will help. Read some posts and optimized MySQL (didn't helped too much) but is any other way to cut the 100% to less somehow? I noticed problem is in HDD speed but I can't increase it as no budget for new hardware.

                              As I’m playing Zabbix (must say like it) new things came to mine mind and just wander if you could help me.

                              I have bandwidth monitoring on couple of routers using Template SNMP Interfaces and this is working perfectly fine. However to troubleshoot issues I would like to monitor this per source (internal) IP address. Is any way to monitor bandwidth per IP address directly on router instead of switch?

                              Other thing is recording of connections on router – source -> destination. I realize this may take a lot of resources but sometimes when I need infected PC in the network could be very helpful.

                              Once again, thank you for help.

                              Comment

                              Working...