Ad Widget

Collapse

Zabbix HA success stories

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jerdmann
    Junior Member
    • Aug 2013
    • 1

    #1

    Zabbix HA success stories

    Hi everyone, I have a question I hope you can help me with. We currently have a Zabbix node-based distributed monitoring configuration with 8 regional child nodes and a single central master. We monitor about 1500 machines, 70000 items, and run about 800 values per second on the master.

    We are looking to move to a proxy-based configuration due to some parent-child replication problems we've run into, and also because nodes are unsupported and are going away in the near future. However, our biggest concern is that a proxy-based configuration seems a bit less 'survivable' than nodes. We have all email actions configured on the child nodes so that if we lose the master, we still have visibility and emails on all nodes. We just lose the single pane of glass (if that makes any sense).

    I was hoping to hear some 'success stories' from other Zabbix users that run a proxy-based distributed config in an environment our size or larger. Specifically, how is everything from an availability perspective? Do you find that a 2-node HA master is 'enough' redundancy? Does replication between the two nodes keep up with your workload? Anything else to add?

    It would be great to hear from other people's experience before we take the plunge for ourselves. Let me know what you think. Thanks for the help!
  • mushero
    Senior Member
    • May 2010
    • 101

    #2
    Our system is about your size and we are moving to proxies for everything, in part to support HA since we can fail over to our standby zabbix server in a different country by just changing a few proxy options (not every host, as they are all locked down to our zabbix public IP) - we run in 100 data centers globally, so our system is all public Internet-based.

    We are on 1.8.3 and problem with proxies is they don't tolerate connection issues very well - actually quite poor, as they just get stuck and won't time out or retry so we have to restart them - soon we'll have a tool to detect this and restart when the local queue gets too large (using SQL).

    Some times we run two routes via route or iptables NAT so our proxies can route around the world in different ways - some day that will be automatic, too, so after 5 bad restarts that still get stuck, we'll change routes.

    I think/hope 2.x proxies will be better at this, as REALLY need them to timeout if no data sent or reply in 30 seconds and re-connect; would solve a lot for us.

    Also, be sure to monitor the queues for the proxies, at the proxy and in Zabbix, with graphs and triggers so we trigger if more than x00 items behind. I can send you SQL for this if you want.

    Overall, we want a central system as our triggers/templates are very complex (about 200 items, 50 triggers/host, lots of custom parts, GUI, etc.) so we love the proxy idea and are working to improve it.

    We're also working on a PHP GUI for the proxy to help show the queue, local data, servers managed, and simple things. Also to refresh the config, etc. We'll share this when it's usable.

    Comment

    • BHG_2008
      Junior Member
      • Jan 2008
      • 23

      #3
      Master-child hierarchy is essential

      We are planning to use localized alerting as well, which is only possible via a node. We have about 2,000 sites in which we are putting a node, so that an aggregated data feed comes in from each site. Also, when the master is in maintenance mode (optimizing the database or upgrading, etc), the local child nodes cache the data points and resume in "catch-up" mode until near real time again. I do not believe the mechanism for caching on proxies is sufficient for this purpose. I feel that removing the child node option is the wrong direction. In fact, I would like to see it expanded in 3 ways:
      1) Allow configuration of hosts on the master that belong to a child of a child
      2) Expand the node limit to 10,000
      3) Allow multiple masters for children, so higher fault tolerance is achieved where necessary

      Comment

      • neominder
        Junior Member
        Zabbix Certified Specialist
        • Feb 2012
        • 11

        #4
        Originally posted by BHG_2008
        We are planning to use localized alerting as well, which is only possible via a node. We have about 2,000 sites in which we are putting a node, so that an aggregated data feed comes in from each site. Also, when the master is in maintenance mode (optimizing the database or upgrading, etc), the local child nodes cache the data points and resume in "catch-up" mode until near real time again. I do not believe the mechanism for caching on proxies is sufficient for this purpose. I feel that removing the child node option is the wrong direction. In fact, I would like to see it expanded in 3 ways:
        1) Allow configuration of hosts on the master that belong to a child of a child
        2) Expand the node limit to 10,000
        3) Allow multiple masters for children, so higher fault tolerance is achieved where necessary
        I agree with this for the most part. One major issue we've run into with proxies is that even though the data gets stored on the proxy when the server is unavailable, calculated items don't get recorded during that time since the calculations happens on the Server not the proxy. With a child node setup the child nodes would be doing the calculated items themselves.

        Comment

        • richlv
          Senior Member
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Oct 2005
          • 3112

          #5
          Originally posted by neominder
          One major issue we've run into with proxies is that even though the data gets stored on the proxy when the server is unavailable, calculated items don't get recorded during that time since the calculations happens on the Server not the proxy.
          doing calculated items on proxies might be an interesting feature request, but there could be a problem that calculated item references items that are monitored by different proxies.
          Zabbix 3.0 Network Monitoring book

          Comment

          Working...