Ad Widget

Collapse

Zabbix_Server periodically stops accepting Active connections

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gleepwurp
    Senior Member
    • Mar 2014
    • 119

    #1

    Zabbix_Server periodically stops accepting Active connections

    Hi,

    For the past month, I have been having issues with the Zabbix trapper components (Active). They randomly freeze and stops accepting connections for 6 to 12 hours at a time. Investigating the issue "jars" the server and it starts processing again. This problem does not affect passive polling or VMware polling, just the receiving of Active Zabbix Agent and Active Proxy item metrics.


    I have run an strace on one of the Zabbix trappers that seem to be stuck in the "[processing]" state:

    Code:
    [gleepwurp@ServerX ~]$ sudo strace -s 256 -p 13518 -tdt
    Process 13518 attached - interrupt to quit
     [wait(0x137f) = 13518]
    pid 13518 stopped, [SIGSTOP]
     [wait(0x57f) = 13518]
    pid 13518 stopped, [SIGTRAP]
    23:07:20.669408 read(7,
    The trapper seems to be stuck just after that "read(7," line. Changing the log level through the "-R log_level_increase" command seems to unstick the trapper as it start processing immediately afterwards.

    The Active agents connecting to the Zabbix Server have these error messages:

    Code:
    15368:20150316:221136.147 active check data upload to [www.xxx.yyy.zzz:10051] is working again
     15368:20150316:221139.147 active check data upload to [www.xxx.yyy.zzz:10051] started to fail ([connect] cannot connect to [[www.xxx.yyy.zzz]:10051]: [4] Interrupted system call)
     15368:20150316:221206.148 active check data upload to [www.xxx.yyy.zzz:10051] is working again
     15368:20150316:221218.149 active check data upload to [www.xxx.yyy.zzz:10051] started to fail ([connect] cannot connect to [[www.xxx.yyy.zzz]:10051]: [4] Interrupted system call)
     15368:20150316:221239.150 active check data upload to [www.xxx.yyy.zzz:10051] is working again
     15368:20150316:221242.215 active check data upload to [www.xxx.yyy.zzz:10051] started to fail ([connect] cannot connect to [[www.xxx.yyy.zzz]:10051]: [4] Interrupted system call)
     15368:20150316:221257.292 active check data upload to [www.xxx.yyy.zzz:10051] is working again
    The Zabbix server is at 682 NVPS for 4130 hosts (332057 items), mostly VMWare monitoring.

    The problem affects Active Items only, Passive and VMWare monitoring is not affected.

    The Zabbix DB is on a separate Server, and both the Zabbix DB and Zabbix Server are pretty much Idle (LoadAVG ~ 0.5, ,even during those "issues").

    Server has 8GB RAM, Database has 32GB RAM, no IOWaits on either.

    Here is the Zabbix_server config values:

    Code:
    DebugLevel=3
    StartPollers=80
    StartPollersUnreachable=40
    StartTrappers=100
    StartPingers=20
    StartDiscoverers=10
    CacheSize=512M
    CacheUpdateFrequency=300
    StartDBSyncers=32
    HistoryCacheSize=256M
    TrendCacheSize=128M
    Timeout=20
    ProxyConfigFrequency=300
    StartVMwareCollectors=20
    VMwareFrequency=300
    VMwarePerfFrequency=300
    VMwareTimeout=30
    VMwareCacheSize=512M
    ValueCacheSize=512M

    Has anyone ever encountered this? Help?

    I was able to temporarily minimize the issue by changing all my 6 Zabbix Proxies to Passive in the meantime, but I'd really like to fix that "Active" issue.

    Thanks!

    Gleepwurp.
  • ingus.vilnis
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Mar 2014
    • 908

    #2
    Hi,

    Not much of a help but two things that I can add here.

    1. Is it possible that you are having some network or firewall issues at those times so connections from agents simply can't get through? Might be so on some smart firewalls that specific posts are blocked due to some amount of traffic.

    2. Unrelated to your Active checks issue. StartDBSyncers=32 is way too much for your 682 NVPS. Each DB syncer is capable of processing ~1000 nvps so you should be safe with the default 4 here.

    Best Regards,
    Ingus

    Comment

    • gleepwurp
      Senior Member
      • Mar 2014
      • 119

      #3
      Thanks for your reply Ingus!

      I too thought maybe some network contention might be the problem, but the thing that made me doubt this is the fact the that Zabbix Agent (ACTIVE) running locally on the Zabbix Server itself has the same issue when trying to connect locally to "itself" using Active...

      I'll try replacing the ServerActive IP from the server's IP to 127.0.0.1 to see if that makes a difference next time... That will at least be a good indication if the IP Stack/Port is the issue, or the Zabbix Trappers.

      Thanks for the DBSyncer advice, I'll drop it down to your recommended value...

      Thanks!

      Gleepwurp.

      Comment

      • ingus.vilnis
        Senior Member
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Mar 2014
        • 908

        #4
        Hi Gleepwurp,

        Yes, strange that the checks fail also on server. Try the 127.0.0.1 for sure.

        Also have a look at "Zabbix data gathering process busy" graph with period when you had these issues. Maybe your trappers are overloaded and you need to add more than 100 in server.conf?

        Best Regards,
        Ingus

        Comment

        • ingus.vilnis
          Senior Member
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Mar 2014
          • 908

          #5
          And check all the other graphs and parameters for high spikes as well.

          Comment

          • gleepwurp
            Senior Member
            • Mar 2014
            • 119

            #6
            Hello Ingus,

            I don't have any spikes to speak of, and the CPU is idle...

            I'm posting the graphs for the Zabbix Server stats below, when I had this issue:

            Well, forget posting the Perf/Process graphs, seems I have a 100k quota for total picture attachment... I'll just post the Min/Max/Avg for Poller and Processes during the period where there was around ~80k items in the queue (period is about 1 day, 20 hours).

            The High MAX values are usually at the end, when Zabbix seems to wake up and process all the back log (lasts about 2-3 minutes at most).

            Gleepwurp.
            Attached Files

            Comment

            • ingus.vilnis
              Senior Member
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Mar 2014
              • 908

              #7
              Hi,

              Hard to tell much from these figures. AVG 1.1% trappers is not optimal. Alerter and history syncer also could be better. CPU is not so important here. I can hardly remember a case when there were any significant CPU load at all. But that's all.

              Is there still a way to see complete graphs and get the overall picture?

              Best Regards,
              Ingus

              Comment

              • gleepwurp
                Senior Member
                • Mar 2014
                • 119

                #8
                Hi Ingus,

                I have historical data/graphs from the last time it happened (less than 7 days ago)... however, I can only have 100k of attachment/graphics total in all my posts throughout this site, so each time I try to post a picture, I have to remove one from my earlier post... And most of my graphs are more than 100k, so they can't be posted here...

                Do you have a 3rd party image-hosting site to suggest?

                Thx,

                Gleepwurp.

                Comment

                • ingus.vilnis
                  Senior Member
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Mar 2014
                  • 908

                  #9
                  Hi,

                  I have never used such hosting before so I cannot suggest you any good so you can search for some yourself. Or maybe share a Dropbox link.

                  Best Regards,
                  Ingus

                  Comment

                  • gleepwurp
                    Senior Member
                    • Mar 2014
                    • 119

                    #10
                    Ok,

                    found a place to post the graphs...

                    Here it is: http://postimg.org/gallery/3if7qwxs/cbf1b4ca/

                    Let me know if you need more graphs...

                    Thank for you help!

                    Gleepwurp.

                    Comment

                    • tchjts1
                      Senior Member
                      • May 2008
                      • 1605

                      #11
                      My .02 cents here:

                      Even though your NVPS is not astronomical, you are working with many hosts at ~ 4,000 collecting mostly VMWare data. Also looking at your graphs (if they were for my setup) I would want to adjust the cache settings so they are a bit more efficient. Of course, this all depends on whether you have available resources to allocate for these.

                      Anyway, here are your settings as you show above, and my comments on which ones I would increment. These are simply suggestions. Any changes require a restart of Zabbix server process.

                      I am unsure about this one as I am going from memory at the moment, but it may also help to add the parameter UnreachablePeriod=120 to your Zabbix server.conf file. I am not sure what it is by default. Maybe 60.

                      Code:
                      DebugLevel=3
                      StartPollers=80
                      StartPollersUnreachable=40
                      StartTrappers=100
                      StartPingers=20
                      StartDiscoverers=10
                      CacheSize=512M            <---- Increment to 1G
                      CacheUpdateFrequency=300
                      StartDBSyncers=32         <---- As Ingus mentioned, put this back to 4
                      HistoryCacheSize=256M
                      TrendCacheSize=128M
                      Timeout=20                   <---- I would put this to 30
                      ProxyConfigFrequency=300
                      StartVMwareCollectors=20  <---- Increment to maybe 40
                      VMwareFrequency=300
                      VMwarePerfFrequency=300
                      VMwareTimeout=30
                      VMwareCacheSize=512M
                      ValueCacheSize=512M
                      I am happy when my graphs are looking like this


                      .
                      Last edited by tchjts1; 26-03-2015, 11:06.

                      Comment

                      • gleepwurp
                        Senior Member
                        • Mar 2014
                        • 119

                        #12
                        Hi,

                        Thanks both you (tchjts1 and Ingus) for the suggestions, I will give them a try!

                        A Linux knowledgeable colleague of mine suggested that maybe I'm running out of sockets.... The current ulimit for the Zabbix user (open files) is set at 1024... Have any of you ever required to increase this limit for Zabbix?

                        Thanks again for all your insights!

                        Gleepwurp.

                        Comment

                        • tchjts1
                          Senior Member
                          • May 2008
                          • 1605

                          #13
                          I will need to check my setup when I get into work and see exactly what I have set.

                          In the meantime, what version of Zabbix are you running on Zabbix Server?
                          Do your proxies match this same version?

                          Comment

                          • gleepwurp
                            Senior Member
                            • Mar 2014
                            • 119

                            #14
                            I've made the suggested changes and I'll follow up with new stat graphs tomorrow to see how it impacts...

                            I'm running 2.4.4 on both the Server and the Proxies (I upgrade them every month)...

                            G.

                            Comment

                            • tchjts1
                              Senior Member
                              • May 2008
                              • 1605

                              #15
                              Originally posted by gleepwurp
                              (I upgrade them every month)...

                              G.
                              You like living on the edge, eh?

                              Just my personal rule of thumb with upgrading Zabbix - I never implement a new major version of Zabbix until it is at least on the x.x.5 release. That gives them time to work out the majority of issues.

                              And I don't upgrade unless there is a need for it (new features/security/functionality that I want). I am still on version 2.0.9 lol.

                              Comment

                              Working...