Ad Widget

Collapse

Agents killed

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ortz
    Junior Member
    • Jun 2012
    • 14

    #1

    Agents killed

    Hi,

    In the last couple of days I'm having big issues with Zabbix.
    Every 1-2 hours all of my agents are killed and all servers become unavailable for this reason.
    In zabbix_server.log I see errors from time to time like the following:
    Zabbix agent item [system.cpu.util[,user,avg1]] on host [SERVER01] failed: another network error, wait for 15 seconds
    This error I used to see even before the problem.

    Since the problem started I see also:
    temporarily disabling Zabbix agent checks on host [SERVER01]: host unavailable

    I have no idea why it happens.
    I have 166 hosts monitored.
    8054 active items.
    2496 triggers.
    Required server performance is 361.

    I'm using Zabbix server with 2vCPU 17GB RAM that runs zabbix_server, mysql_server and the frontend.
    I also have Zabbix proxy that has about 70+- servers behind it.

    Note that the servers are running in Amazon Cloud.

    When this problem occurs all servers see unavailable and I must start the agents for this problems to solve it self.

    I've tried various method to fix that problem such as removing unnecessary items/triggers/hosts, I also recovered the server from backup image because I thought maybe the server might be running on bad hardware.

    Has anyone encountered such issue?
    Anything I can do?
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    What version of Zabbix are you running? What OS are you installed on?
    What versions of Apache? PHP? MySql?

    If you are using the Zabbix 2.x release, can you give us screenshots of the graphs that are attached to your Zabbix server? There are 2 graphs for internal items that show Zabbix processes busy %. Make the time period for the graphs like 14 days. Go to Monitoring --> Graphs -->Zabbix server. The 2 graphs I am talking about have around 10 or 12 items each.

    Regarding the "failed: another network error", barring any actual network problems causing the issue, you can go into zabbix_server.conf and increase the value for Timeout=. By default that is at 3 seconds. I have changed mine to 10 and rarely see those errors anymore. You'll have to restart Zabbix server after you make that change.

    Comment

    • ortz
      Junior Member
      • Jun 2012
      • 14

      #3
      Thank you for the quick reply.
      Zabbix version is 2.0.0
      OS is RHEL 6.2
      Apache (httpd) is 2.2.15
      PHP is 5.3.3
      MySQL is 5.1

      about the graphs you asked I don't see they exist in my system (maybe because I upgraded from 1.8 to 2.0 and not fresh install?)

      I increased the Timeout to 10 seconds, hope it will help a bit.

      Anything else ?

      Thanks again!

      Originally posted by tchjts1
      What version of Zabbix are you running? What OS are you installed on?
      What versions of Apache? PHP? MySql?

      If you are using the Zabbix 2.x release, can you give us screenshots of the graphs that are attached to your Zabbix server? There are 2 graphs for internal items that show Zabbix processes busy %. Make the time period for the graphs like 14 days. Go to Monitoring --> Graphs -->Zabbix server. The 2 graphs I am talking about have around 10 or 12 items each.

      Regarding the "failed: another network error", barring any actual network problems causing the issue, you can go into zabbix_server.conf and increase the value for Timeout=. By default that is at 3 seconds. I have changed mine to 10 and rarely see those errors anymore. You'll have to restart Zabbix server after you make that change.

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        I can attach that template here when I get into work, or you can get the raw XML from this link and do an import, then attach it to your Zabbix server.



        It provides very valuable information of what is happening with your Zabbix processes.

        In the meantime, is your zabbix_server.log giving you any indication of what is happening?
        When you say the agents are "killed", are the services still running when you experience the issue?
        When you upgraded the Zabbix server, did you also upgrade your agents?
        Last edited by tchjts1; 02-05-2013, 16:10.

        Comment

        • ortz
          Junior Member
          • Jun 2012
          • 14

          #5
          Hi, I've added the template and the graphs are attached.
          Please note that I have data for 5-10 minutes at the moment

          Hope it helps also.

          Once again, thank you!




          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            It is good that you got the template attached and the graphs going. That short period of time isn't going to tell the real story though, but it is going to help going forward with troubleshooting.

            Comment

            • tchjts1
              Senior Member
              • May 2008
              • 1605

              #7
              These questions were in a previous reply above:

              In the meantime, is your zabbix_server.log giving you any indication of what is happening?
              When you say the agents are "killed", are the services still running when you experience the issue?
              When you upgraded the Zabbix server, did you also upgrade your agents?

              Also, if you are actually on release 2.0.0, I would definitely upgrade to the latest stable release of 2.0.6 on your Zabbix server if you can.

              Comment

              • ortz
                Junior Member
                • Jun 2012
                • 14

                #8
                Hi,

                zabbix_server.log only saying network error trying again in 15 seconds, and after 15 seconds host unavailable.
                When the agents are killed their service is dead also (agent-side).
                When I upgraded Zabbix of course, I upgraded also the agents (just removed the old installation and installed from scratch).

                I'll have to schedule maintenance for this, but I'll do it as fast as I can.
                In the meanwhile I created a PHP script that connects to the API every 10 minutes and if it finds any agents unavailable problems it restarts the agent.
                Of course this is not the way to fix the problem, but it helps for now.

                I've attached the graphs for 3 days now instead of couple of hours, hope it helps finding the problem.




                Comment

                • tchjts1
                  Senior Member
                  • May 2008
                  • 1605

                  #9
                  Looks like their is some tweaking you can do, based on those graphs.
                  One thing I would do to help alleviate the "Another network error" message is to go into zabbix_server.conf and change your Timeout= vale from the default of 3 and try it at 10. This helped me tremendously when I was seeing a lot of those errors.

                  Another change I would make while in that config file, is to increase the value that you have for your configuration cache. While 60 isn't horrible, I try to keep mine in the 80% or above range. (This setting depends on how much memory you have available)

                  Can you provide a screenshot of your graphs for memory (free/used) and swap space (free/used) for your Zabbix server? A 7 day time period would be best.

                  Although, none of this would explain why your Zabbix agent service is being killed on your hosts. That would have zero to do with your Zabbix server setup or performance.
                  Agents don't just shut down en mass.
                  Last edited by tchjts1; 05-05-2013, 22:45.

                  Comment

                  • ortz
                    Junior Member
                    • Jun 2012
                    • 14

                    #10
                    Hi,

                    I already increased the timeout to 10 seconds.
                    About the memory I'm using almost all of the memory (100MB Free out of 17GB), note that a lot of this memory is for MySQL, and there is about 6GB cached memory (attached images below).

                    I can reduce MySQL buffer in-favor of Zabbix server if it will help...

                    I don't think it is a problem on Zabbix agents because it wouldn't explain why every time the agents are killed it is done simultaneously on 80-150~ hosts at the very same second...



                    Comment

                    • tchjts1
                      Senior Member
                      • May 2008
                      • 1605

                      #11
                      I'd be a little curious as to what causes your free memory to dive like that over a 10 hour period.

                      Anyway, for comparison purposes, here are my pertinent zabbix_server.conf sand my.cnf settings

                      These are configured for Zabbix App server and DB on 2 different VM's running Linux RedHat. Zabbix app server has 4 vCPU's and 8GB of memory. Zabbix DB server has 8 vCPU's and 16GB of memory.

                      I would still be curious to see the graphs for your swap space usage.

                      zabbix_server.conf
                      Code:
                      ### Option: StartPollers
                      #       Number of pre-forked instances of pollers.
                      #
                      # Mandatory: no
                      # Range: 0-1000
                      # Default:
                      # StartPollers=5
                      StartPollers=100
                      
                      ### Option: CacheSize
                      #       Size of configuration cache, in bytes.
                      #       Shared memory size for storing host, item and trigger data.
                      #
                      # Mandatory: no
                      # Range: 128K-1G
                      # Default:
                      # CacheSize=8M
                      CacheSize=128M
                      
                      ### Option: HistoryCacheSize
                      #       Size of history cache, in bytes.
                      #       Shared memory size for storing history data.
                      #
                      # Mandatory: no
                      # Range: 128K-1G
                      # Default:
                      # HistoryCacheSize=8M
                      HistoryCacheSize=128M
                      
                      ### Option: HistoryTextCacheSize
                      #       Size of text history cache, in bytes.
                      #       Shared memory size for storing character, text or log history data.
                      #
                      # Mandatory: no
                      # Range: 128K-1G
                      # Default:
                      # HistoryTextCacheSize=16M
                      HistoryTextCacheSize=128M
                      my.cnf
                      Code:
                      ## Added 10/10/12
                      port = 3306
                      skip-external-locking
                      max_allowed_packet = 1M
                      table_open_cache = 512
                      read_buffer_size = 2M
                      read_rnd_buffer_size = 8M
                      myisam_sort_buffer_size = 64M
                      #thread_concurrency = 8
                      
                      # Zabbix parameters
                      innodb_file_per_table
                      max_allowed_packet = 16M
                      innodb_data_home_dir = /data/mysql
                      innodb_data_file_path = ibdata1:10M:autoextend
                      innodb_log_group_home_dir = /data/mysql
                      innodb_buffer_pool_size = 8G
                      innodb_additional_mem_pool_size = 32M
                      innodb_lock_wait_timeout = 120
                      innodb_log_file_size = 120M
                      innodb_thread_concurrency = 8
                      key_buffer_size = 512M
                      max_connections=512
                      table_cache=4096
                      query_cache_size = 128M
                      tmp_table_size = 8M
                      thread_cache_size = 64
                      sort_buffer_size = 16M

                      Comment

                      • ortz
                        Junior Member
                        • Jun 2012
                        • 14

                        #12
                        Problem solved!

                        Hi,

                        Just a little update, I modified yesterday the StartPollers parameter (it was on the default - 5, and I changed it to 50) and everything works fine now.
                        My required server performance was standing on 360~ and probably it's a lot of work to do for 5 Pollers, after increasing it all queues cleared and no network errors again.

                        Thank you very much for helping in this issue.

                        Comment

                        Working...