Ad Widget

Collapse

Zabbix External Checks - How much is too much?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jonxor
    Junior Member
    • Jun 2016
    • 24

    #1

    Zabbix External Checks - How much is too much?

    Hello!
    I have a fairly large zabbix environment:
    4081 hosts
    344991 Items
    2339 NVPS
    12 proxies

    For about 2000 hosts, I want to run an external check, every 60 seconds, which should return in under 15 seconds.

    Are these like web tests, where a proxy will perform the check, or do all checks happen from the main zabbix collector?

    Which pool of threads perform external checks? I am assuming it's normal pollers.

    The documentation states not to "overuse" external checks, for risk of degrading performance, so I am wondering how well they scale.

    My collector and proxies are mostly high spec (2x6 core hyperthreaded W 128GB RAM). Is the degradation in performance related to CPU, Disk IO, RAM, etc?
    Which spec would I increase to be able to run more external checks?

    Do external checks still obey the "timeout=" option defined in zabbix_server.conf or zabbix_proxy.conf?

    I am guessing that if the script were to become broken, then the poller threads would each lock up for 15 seconds while performing the test.
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Originally posted by jonxor
    The documentation states not to "overuse" external checks, for risk of degrading performance, so I am wondering how well they scale.
    Generally impact of using external checks is like on "zabbix agent" items (passive items).
    First: you should have no monitoring over server. This will allow generate on the host where is running server more predictable workload. Your all hosts (dummy as well) should be monitored over proxies. Kind of side effect of this change would be that you could be able to restart server without interrupting interrupting collecting monitoring data
    Second: try to group hosts with different external checks per proxy.

    Target of both changes is generally reduce CPU cache misses. As long as more different types of processes are running on the same host more likely you can saturate max bandwidth to/in CPU cache (from/to RAM) than saturate CPU usage.
    Simple more and more time CPU will be waiting on delivery exact memory page to CPU cache than just executing some code which is using some data which are present in this cache.
    More parallel you are executing more sense is using HW with more CPU cores.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      Originally posted by jonxor
      Which spec would I increase to be able to run more external checks?

      Do external checks still obey the "timeout=" option defined in zabbix_server.conf or zabbix_proxy.conf?

      I am guessing that if the script were to become broken, then the poller threads would each lock up for 15 seconds while performing the test.
      As mentioned, offloading to proxies can be a big help. The short answer is an external check is going to fork off multiple processes (2-3 if I recall) while it runs your script. This does not require massive resources but if your check takes a long time to run, it does tie them up for that time including process slots. They run in the normal poller context, so increasing the number of pollers lets you have more. They also honor the same timeout, and yes a hung check will hang for at least the timeout period (an error might not of course).

      One possibility, if you have external checks that are related, is run several all together in one external routine, let it return one value normally (by its output) but internally use zabbix_sender for others. For example, I have an external check that polls for certain OID's (but that vary so it does a bit of hunting around each time, which is why it doesn't use an snmp agent check). Instead of calling this once for description, then again for a metric, then another metric -- I call it once, and return about 6 items together, 1 as output, 5 in a batch with zabbix_sender.

      It is also possible (though there are some real downsides) to put your checks externally entirely, so they are not polled by zabbix but just sit and run, and push their own data over via zabbix_sender in batches. Coordinating this with changes in zabbix configs though makes this rarely a good idea, but is possible. This can be useful however for spontaneously generated items, like events in an application that you want as they occur, rather than polling for them (like snmp traps, but not needing snmp).

      Comment

      • jonxor
        Junior Member
        • Jun 2016
        • 24

        #4
        OK, This is good info. Thank you for your replies.

        I will try this out and let you know how it works. I'll try to benchmark the load before and after.

        Comment

        • jonxor
          Junior Member
          • Jun 2016
          • 24

          #5
          Well, I implemented this.
          The script is a simple python script designed to connect to a server's IPMI module (we have populated our DNS such that <hostname>.ipmi.ourdomain.lan resolves to the system's IPMI interface)
          This made it easy to make one dynamic item to apply to all servers, using the {HOST.NAME} Macro.
          The script accepts a hostname as an argument.

          The script queries the interface using ipmitool, and returns with status codes for the power supplies.

          The time for running the script directly from the command line is under 1 second, usually.

          I put the script in the externalscripts folder on the proxies, and added it to the template for about 2400 hosts. My largest proxy has about 1200 of these hosts, and is configured for 600 poller threads. I used a polling interval of 30 seconds initially, and watched with the "ps ax" command on the proxy to see how many instances of the script were running. It bounced up and down between 30 and 110, averaging around 60. I changed the polling interval to 240, and it dropped to 2 - 10 simultaneous instances.

          I haven't seen any performance issues yet, so I think this should be OK.

          I didn't get a chance to benchmark performance before and after. A brief glance looks to be about the same amount of load as it was under before, so I'm not too worried.

          The thing I am more concerned with now is DNS, since each run of the script is another DNS query. I'll have to keep an eye on my DNS infrastructure to make sure this doesn't blow it up.

          Thanks for input, everybody!
          Last edited by jonxor; 15-07-2016, 23:49. Reason: added more info.

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by jonxor
            The script queries the interface using ipmitool, and returns with status codes for the power supplies.

            The time for running the script directly from the command line is under 1 second, usually.
            I'm preparing to do IPMI monitoring in some larger env so I would be glad to see more details to not make your mistakes

            IMO provide IPMI monitoring as something working from server/proxy was kind of mistake because it creates at least two types of problems:

            1) security issues as long as IMPI items must be monitored from central point it is possible to penetrate many hosts from single point if IPMI remote interface will have some security bugs

            2) scalability issues because even if IPMI metrics could be monitored using less resources demanding way (without your python script) still proxy is here kind of SPOF and at some scale of such checks only because those issues will be a problem it will be necessary to use yet another proxy.

            Kind of question: is it really not possible to use IPMI proxy/server interface to do what you want you are doing with your python script? Where is the problem?

            IMO best way of solving above issues would to have IPMI keys as zabbix agent items.

            My personal plan about organizing IPMI monitoring is to take existing zabbix code and convert it to agent loadable module which should be relatively easy to do.
            Only obstacle could be here some other reasons why zabbix IPMI interface was implemented as it is now available. Knowing IMO enough about IPMI so far I don't see any possible reasons why it should be not possible to do implement this as zabbix agent items.

            Best would be to have agents items under exactly the same keys names as they are now so another thing which needs to be changed here is ability to use IPMI keys as zabbix agent an/or zabbix active agent type items.

            Very similar situation like with IPMI is with ODBC zabbix interface. IMO both interfaces should be possible to use as agent items.
            Last edited by kloczek; 16-07-2016, 11:53.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • jonxor
              Junior Member
              • Jun 2016
              • 24

              #7
              I had a problem where my IPMI modules wouldn't return values for the keys "ps1Status" and "ps2status". I confirmed these findings when the ipmi module directly with ipmitool. We did find that when we used "ipmitool -I lanplus sdr type 'Power Supply'" some of our IPMI modules would return just an OK or Bad if a PSU had failed. Other motherboards provided more granular statuses, such as if AC input was lost on either power supply, if a Power supply had failed, or if it could not determine the state.

              I didn't dig into how zabbix's IPMI agent worked, since one of my interns had already written the script and demonstrated that it worked. Deploying the script was the path of least resistance.

              The script determines which motherboard model the system is, and then interprets the output.

              Comment

              Working...