Ad Widget

Collapse

Zabbix Proxy performance as environment grows

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tmroberts
    Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Jan 2017
    • 73

    #1

    Zabbix Proxy performance as environment grows

    I am in the process of merging two separate Zabbix environments with a new larger system. Currently, the server resides in AWS and is experiencing no performance problems what so ever and has approximately 1700 hosts and is running about 1900nvps. The proxy servers, of which there are 20, generally are not experiencing any performance issues, except one which is currently monitoring about 480 hosts. The proxy servers are all running the following specifications:

    CentOS 7
    Zabbix v3.4.10
    4 CPU (a couple have 8)
    16GB memory (a couple have 24GB)


    All of the proxy servers have a loop back network interface with the same IP address that gets routed to the server so that regardless of where you are in our network the agent config files all use the same Server and ServerActive IP address. This works really great by the way....

    and they all have the following configurations:

    CacheSize=512M
    ConfigFrequency=300
    DBName=zabbix
    DBPassword=zabbixaws
    DBSocket=/var/lib/mysql/mysql.sock
    DBUser=zabbix
    DataSenderFrequency=5
    DebugLevel=3
    EnableRemoteCommands=1
    ExternalScripts=/usr/lib/zabbix/externalscripts
    HeartbeatFrequency=60
    Hostname=--------zbprx01
    HostnameItem=--------zbprx01
    HousekeepingFrequency=1
    ListenIP=xx.xxx.x.170,yy.yy.yyy.50
    LogFile=/var/log/zabbix/zabbix_proxy.log
    LogFileSize=512
    LogSlowQueries=3000
    PidFile=/var/run/zabbix/zabbix_proxy.pid
    ProxyMode=0
    ProxyOfflineBuffer=24
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    Server=xx.xxx.x.xxx
    SocketDir=/var/run/zabbix
    SourceIP=xx.xxx.x.170
    StartDBSyncers=4
    StartDiscoverers=10
    StartHTTPPollers=10
    StartIPMIPollers=5
    StartPingers=30
    StartPollers=40
    StartPollersUnreachable=30
    StartTrappers=5
    StartVMwareCollectors=1
    Timeout=30
    UnavailableDelay=30
    UnreachableDelay=30
    VMwareCacheSize=256M


    xx.xxx.x.170 is the IP address for the Loop back interface and yy.yy.yyy.50 is the actual IP address of the proxy server. I've also omitted the hostname of the proxy for privacy sake so that's what --------zbprx01 is.

    The proxy server that I am experiencing issues with has the StartPollers set to 50.

    I can't seem to get the Zabbix busy poller below about 80% on average. If I were done merging systems i wouldn't be too concerned but I still have about 3000 hosts between the two older systems to migrate so I'm a bit concerned about performance going forward and would really like to know how to scale this up more.
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Originally posted by tmroberts
    I can't seem to get the Zabbix busy poller below about 80% on average. If I were done merging systems i wouldn't be too concerned but I still have about 3000 hosts between the two older systems to migrate so I'm a bit concerned about performance going forward and would really like to know how to scale this up more.
    You need to start moving away from passive agent monitoring because it does not scale.
    Passive monitoring has bottleneck in pooling from proxy and/or server.
    Switching to the active monitoring allows use CPU ctx (context switches) bandwidth available on each monitored host instead scale it only on the proxies and/or server.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • tmroberts
      Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Jan 2017
      • 73

      #3
      As a policy I only allow the use of non-active items unless absolutely necessary. So the only passive metrics from agents are the Zabbix Agent template based metrics. Having said that, the vast majority of the items are SNMPv2 and SNMPv3 based.

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        SNMP metrics by definition are passive and they are using pooling.
        However in case zabbix items choosing passive monitoring is kind of bad decision not only from scalability buy security point of view as well.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment


        • tmroberts
          tmroberts commented
          Editing a comment
          Agreed. I may not have been clear, by policy we allow only zabbix active items unless absolutely necessary, thus 95% of our non-snmp items are active items. So with SNMP items being passive by definition, is there any way to get around the load the proxy is experiencing? With the way we are locating our Proxy Servers and using OSPF/loopback routing for the proxy address, I don't think we'll be able to place two proxy servers in the same network. Further, from what I have read on this forum, there doesn't seem to be any way to load balance a proxy, like behind an F5 or anything, unless that is changing with v4.0. So what would you recommend? This particular proxy server is already under quite a load and can expect a significant number of hosts/items to come yet.
      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #5
        You need to say a bit more about your issue.
        Number of hosts monitored over prx is not a good factor ..
        It would be good to know proxy nvps? used type of DB backend? do you have patched proxy to use higher than standard ZBX_MAX_HRECORDS? what exactly bother you?

        PS. I don't know any possible scenario in which is necessary to use passive items and/or in which is not possible to replace this type of used items by active one.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment


        • tmroberts
          tmroberts commented
          Editing a comment
          First to address the issue of passive items, I did a quick search and currently the only passive agent item is the agent ping item, of which we currently have only 430 agent ping items in the entire system.... so anything else that would be passive is SNMP based.

          To answer the questions about the proxy specifics:
          Proxy NVPS is currently showing about 860 NVPS (from the Admin --> Proxies page)
          Database is MySQL 5.7
          As far as "patched proxy", I installed from packages so I'm not sure about the ZBX_MAX_HRECORDS.

          What is bothering me is that I am getting alerts off an on through out the day saying that my Zabbix busy poller is more than 75% busy, and in fact it quite frequently peaks to 100%. As of writing this response its averaging about 70%. My concern isn't that its running at 70% right now (with or without peaks) but that I still have nearly 2500 hosts to migrate to this system, with a vast majority of them to be pointed at this particular proxy. So I guess my biggest concern is scalability and max number of hosts/nvps. I'm pretty sure I have "most" of our networking gear already moved over, which are the biggest consumers of the SNMP traffic in Zabbix by far, so most of the remaining hosts should only be agent based host.... I hope.

          It might help also to understand how we have the network interface/routing set up. Each proxy has a loop back address that is the same on ALL proxies (i.e. 10.10.200.10) and the configs for each proxy use that as the source IP address. Each agent config will use the same IP as the Active/Passive server IP so that regardless of where the agent resides in the company the configuration files are identical. On the network side, the loopback address routes to the closest proxy in the network (by route not distance). So far this has worked incredibly well, with one exception. If we have two proxies on the same network subnet, it seems to favor one proxy over the other, or bounce back and forth, so we aren't able to put more than one proxy in a subnet. I haven't tried putting a second proxy in the same subnet with out using the loopback address yet, so I don't know if that would work. I think if I did that I could then manually move non-discovered snmp hosts to those proxies. Something to try I guess.
      Working...