Ad Widget

Collapse

Zabbix_Server periodically stops accepting Active connections

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gleepwurp
    Senior Member
    • Mar 2014
    • 119

    #16
    hehe, yeah, well, living on the edge might change once we officially go into production...

    I have about 17 Zabbix Server instances (one for each client network zone) with about 60 proxies spread out between them...

    I'm just tired with running old software... We still have our Tivoli ITM 5.1.2 from 2004 up and running, but we'll be migrating most of that stuff to Zabbix.

    G.

    Comment

    • ingus.vilnis
      Senior Member
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Mar 2014
      • 908

      #17
      Hi,

      Thank you for the graphs. Since your issue was already some time ago, the graphs, as you might have noticed, are not from history tables but from trends, thus giving more straightened lines and not showing spikes.

      I don't have much to add here and our friend tchjts1 already gave you good advice on some improvements.

      Open files limit might also be true. As well as some other Linux settings. But please monitor your environment and maybe provide some newer data and graphs if you want. Maybe together we can find some clues.

      Best Regards,
      Ingus

      Comment

      • gleepwurp
        Senior Member
        • Mar 2014
        • 119

        #18
        Hi Ingus,

        so far it's smooth sailing... I'll try to re-enable some Zabbix Proxies to Active and see how it goes..

        I'll keep you guys posted, thank you for the insights and explanations!

        Gleepwurp.

        Comment

        • tchjts1
          Senior Member
          • May 2008
          • 1605

          #19
          Originally posted by gleepwurp
          I have about 17 Zabbix Server instances (one for each client network zone) with about 60 proxies spread out between them...
          Now I'm curious about this. Are you saying you have 17 Zabbix App servers?

          As an experienced Zabbix Admin, I can only imagine applying for your job and at the interview when they tell me I would have 17 Zabbix App servers to manage, I would simply stand up while chuckling and tell them they don't have enough money to pay me to manage that... as I was heading out the door laughing.

          1 setup can truly piss me off sometimes. 17 would drive me insane.
          On the other hand, I think 60 proxies would be manageable. Maybe I am already insane.

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #20
            Originally posted by tchjts1
            Code:
            StartPollers=80
            StartPollersUnreachable=40
            StartTrappers=100
            StartPingers=20
            StartDiscoverers=10
            CacheSize=512M            <---- Increment to 1G
            CacheUpdateFrequency=300
            StartDBSyncers=32         <---- As Ingus mentioned, put this back to 4
            HistoryCacheSize=256M
            TrendCacheSize=128M
            Timeout=20                   <---- I would put this to 30
            ProxyConfigFrequency=300
            StartVMwareCollectors=20  <---- Increment to maybe 40
            VMwareFrequency=300
            VMwarePerfFrequency=300
            VMwareTimeout=30
            VMwareCacheSize=512M
            ValueCacheSize=512M
            I am happy when my graphs are looking like this


            .
            Size of ConfigCache depends on number of all monitored and not monitored items (monitored hosts configuration is kept in cfg cache as well because it is needed for escalations and calculated items).
            I have CacheSize=128M and it is enough to monitor about 125k items.
            So with ConfigCache=1GB would be possible to monitor ~1 mln items.

            Value of StartDBSyncers depends on few factors. On server on number of proxies connected to server and number of host straight monitored over server (without proxies).
            Very similar is with all Start* variables.

            On tuning zabbix and proxy good enough is zabbix server template which comes OOTB.

            Timeout=30 on server does not make any sense when server is monitoring only agents working as active. Why? Because in this mode agents are trying to connect to the server/proxy and timeout needs to be tuned on agents side. The same is with active proxies.
            IMO touching Timeout is asking for trouble because it may only hide some network layer issues on physical paths between agents servers and proxies if active monitoring is used. Using passive proxies and agents does not scale at all above some size of monitored env. Instead tweaking timeouts better is switch to active monitoring.
            Last edited by kloczek; 30-03-2015, 20:04.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • gleepwurp
              Senior Member
              • Mar 2014
              • 119

              #21
              Originally posted by tchjts1
              Now I'm curious about this. Are you saying you have 17 Zabbix App servers?

              <snip>
              1 setup can truly piss me off sometimes. 17 would drive me insane.
              On the other hand, I think 60 proxies would be manageable. Maybe I am already insane.
              I'm well on my way to getting insane!

              I'm managing the 17 Zabbix App servers, 67 proxies and that's just for starters, as I'm developping Zabbix Auditing and Runbook generation + integrating datacollection of those 17 Servers in a centralized datawarehouse.

              I do have _some_ help, but it's a lot of work!

              We had envisioned having the "Multi-node" setup, before we found out that it was a one-way street, and that it was being dumped in 2.4. The real reason for having these 17 Zabbix Servers is to spread the risk of having the Zabbix Server down affecting the entire monitoring operation, and the sheer size of the database required to run said Zabbix server.

              G.

              Comment

              • tchjts1
                Senior Member
                • May 2008
                • 1605

                #22
                I started out using 1.4

                I have always used proxies. Never multi-node. But I do recall seeing a fair number of issues with node based distributed setups. Hence why they did away with it. I like and hate the proxies at the same time. They function well and consolidate the TCP connects going over the WAN... but...

                If you are using active checks for hosts that go to a proxy, there is basically no way to point those hosts to a different proxy if their original proxy goes down. Yeah, you can select a different proxy from the dropdown list for them to report to, but the active checks won't work unless you change the zabbix_agentd.conf file on each host.

                I guess the way I would deal with that is to have a duplicate standby template with all items set to passive, unlink the active template and attach the passive template. There would be a few items that wouldn't work where they require an item to be active agent, such as logs... but oh well.

                Another thing to keep in mind, and I learned this from experience, is to not set the proxy offline buffer too high. We lost the network for our Zabbix App and DB server for almost a full day. The proxies went on about their business happily collecting data and holding it in the buffer. I believe I had offline buffer set to 12 or 24 hours. What I quickly learned was this...

                It took Zabbix App server 1 hour to process 2 hours worth of proxy data. Meaning it would take 6 hours to process 12 hours of proxy data. And then it would have to play catch up for the last 6 hours it just spent processing proxy data to get to "now". I ended up just killing the proxy process and blew away the DB and recreated it. It gave us a large gap in missing data.

                Sorry I am rambling, but since you have so many proxies, these may be good points to ponder. I now set my proxy offline buffer to only 1 hour.

                Comment

                • tchjts1
                  Senior Member
                  • May 2008
                  • 1605

                  #23
                  Originally posted by kloczek
                  On tuning zabbix and proxy good enough is zabbix server template which comes OOTB.
                  I am not quite sure what you are trying to communicate on this. I don't recall where templates came into question in this thread. But the Zabbix server template, as long as it includes the cache items and Zabbix internal items, then yes, it is good enough to monitor Zabbix performance. I think that data should be reviewed on a regular basis to be proactive rather than reactive.

                  Unfortunately in my current 2.0.9 version, those internals only report data on the Zabbix App server and will not give anything for proxies. I believe that has been changed in the newer releases to include proxies.

                  Comment

                  • gleepwurp
                    Senior Member
                    • Mar 2014
                    • 119

                    #24
                    Originally posted by tchjts1

                    Sorry I am rambling, but since you have so many proxies, these may be good points to ponder. I now set my proxy offline buffer to only 1 hour.
                    Oh, I'll take all the rambling and advice I can get, no worries!

                    We deployed (and are still deploying) the Zabbix Agents with both active and passive settings enabled, which is really a no-brainer I think.

                    What we did do to solve that "fail-over/load-balancing" is that each site has at least 2 Zabbix proxies, and the agents get configured with both of the proxies IP address. So sure, I'll get a entry in the zabbix agent log file saying that it can't get its Item list from one of the server, but it means that I can load-balance the Zabbix Agents on both of the Zabbix Proxies, or switch them all to the working Zabbix proxy if one of the proxies in the pair goes down. And no need to reconfigure any agents in the process...


                    I just go a hit tonight on those Trapper sockets freezing up, and as usual, as soon as I started looking around, they all came back to normal...

                    I had time to run a "ss -s" command (socket statistics):

                    Code:
                    Total: 549 (kernel 641)
                    TCP:   2889 (estab 297, closed 2461, orphaned 3, synrecv 0, timewait 2461/0), ports 3131
                    
                    Transport Total     IP        IPv6
                    *	  641       -         -        
                    RAW	  4         4         0        
                    UDP	  33        29        4        
                    TCP	  428       421       7        
                    INET	  465       454       11       
                    FRAG	  0         0         0
                    Sockets don't seem too busy, but it might be because it was while if was starting to work again...

                    I've just set up an action, so that some diagnostics commands are automatically run when my "Zabbix Queue (10m) over 2000 items" trigger happens:

                    Code:
                    ss -s > /tmp/ss_summary.problem
                    ss > /tmp/ss_list.problem
                    lsof > /tmp/lsof.problem
                    ps -ef|grep zabbix_server > /tmp/ps.problem
                    increase_trapper_verbosity.sh
                    As always, Insights/Suggestions welcomed!

                    G.

                    Comment

                    • gleepwurp
                      Senior Member
                      • Mar 2014
                      • 119

                      #25
                      Originally posted by tchjts1
                      Unfortunately in my current 2.0.9 version, those internals only report data on the Zabbix App server and will not give anything for proxies. I believe that has been changed in the newer releases to include proxies.
                      It does... However you have to mind how and where you assign those "Zabbix internal" items...

                      The "Zabbix Internal" has to be assigned to a host that uses the Zabbix proxy in question.

                      I originally had Zabbix Agents installed on my Zabbix proxies, and they were reporting directly to my Zabbix Server. And each time I looked at the "Proxy" metrics, they would be eerily similiar to the Server's metrics...

                      To get them working right (and now I understand why), that Zabbix Agent has to be reporting to that Zabbix proxy for the Zabbix Internal item to reflect the Proxy's data...

                      Just so you don't get that "What the Heck" moment I got when looking at all my proxies and seeing the exact same metrics everywhere!

                      G.

                      Comment

                      • tchjts1
                        Senior Member
                        • May 2008
                        • 1605

                        #26
                        Originally posted by gleepwurp
                        What we did do to solve that "fail-over/load-balancing" is that each site has at least 2 Zabbix proxies, and the agents get configured with both of the proxies IP address. So sure, I'll get a entry in the zabbix agent log file saying that it can't get its Item list from one of the server, but it means that I can load-balance the Zabbix Agents on both of the Zabbix Proxies, or switch them all to the working Zabbix proxy if one of the proxies in the pair goes down. And no need to reconfigure any agents in the process...
                        I wonder if this is not part of the problem you are seeing. Are you populating "ServerActive" with 2 proxy IP's/DNS? The "ServerActive" parameter is not really meant to be used as a failover solution. You say your logs show you can't get a list of active checks from one of the proxies... is that because you have it off-line until you need it in a failover situation?

                        I ran into a similar issue when I did that. What happens is that you get this non-stop flapping going on. The host is switching on and off between the 2 proxies you have defined.

                        What made me aware of this was that I use auto-registration for my hosts. I get an e-mail any time a new host self-registers. When I put 2 entries in "ServerActive", I got a flood of e-mails about these hosts registering in Zabbix because they would first use one proxy, then snap over to the other proxy, then back again... and again... and again. I could watch the assigned proxy in the GUI change back and forth automatically as well. It also became apparent in gaps in the data on the graphs.

                        There was some talk about changing the way "ServerActive" worked in 2.2.x but I don't know if they made any improvements.

                        Richlv, who is a Zabbix Guru (you're welcome Rich) wrote a blog post about implementing ServerActive at this link: http://blog.zabbix.com/multiple-serv...gent-sure/858/ which to me made it sound like you could use it as a proxy failover solution, or perhaps that is simply how I interpreted it. When I opened a ticket to Zabbix support with my proxy flapping issues, they stated it wasn't meant to work as a failover scenario.

                        Maybe I will point out this post to Rich and get his opinion on it. As I said, perhaps they improved the way ServerActive works somewhere in 2.2 or 2.4

                        Comment

                        • richlv
                          Senior Member
                          Zabbix Certified Trainer
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Oct 2005
                          • 3112

                          #27
                          not a guru, but some quick notes on this topic

                          indeed, serveractive is not meant to be used as a failover solution. agent tries to work with both proxies (or servers - it has no idea what's on the other end) in parallel.
                          switching a host from one proxy to another happens with a delay (on the proxies).
                          this means that both proxies could have host configuration active at some point, and you would get duplicate data.
                          on the other hand, this also means that there could be a case where no proxy has the host config data, and they would reject agent connections - that would result in missing data.

                          i would probably suggest a cluster solution with a virtual ip if proxy failover is needed - should be more robust
                          Zabbix 3.0 Network Monitoring book

                          Comment

                          • tchjts1
                            Senior Member
                            • May 2008
                            • 1605

                            #28
                            Originally posted by richlv
                            i would probably suggest a cluster solution
                            Thanks for the fast reply Rich. Dang, I knew I should have taken that part of the training!

                            Comment

                            • gleepwurp
                              Senior Member
                              • Mar 2014
                              • 119

                              #29
                              Oddly enough, it seems to be working fine, since only one of the proxies has the actual configuration for the individual agents, but I agree we might get the Agent simultaneously reporting to both of the proxies on changes/failover.

                              I don't see how it might impact the Trappers on the Zabbix Server though...?

                              I'll go and read the link you posted and educate myself!

                              Thanks for the input guys, this is really good information!

                              G.

                              Comment

                              • Andreas Bollhalder
                                Senior Member
                                Zabbix Certified Specialist
                                • Apr 2007
                                • 144

                                #30
                                Originally posted by gleepwurp
                                Just so you don't get that "What the Heck" moment I got when looking at all my proxies and seeing the exact same metrics everywhere!
                                I made the same experience. Switched recently to let the agent of the proxy send his data directly to the server, because I want to know, if the proxy daemon has died... Now, would have revert that again

                                Andreas
                                Zabbix statistics
                                Total hosts: 380 - Total items: 12190 - Total triggers: 4530 - Required server performance: 224.2

                                Comment

                                Working...