Ad Widget

Collapse

Sparse data & big queue with active agents only

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • myth0s
    Junior Member
    • Jul 2014
    • 8

    #1

    Sparse data & big queue with active agents only

    Hi,

    I'm having trouble with my active-only agent configuration. All of my agent are behind a firewall and the only real way to monitor them is through an "active agent" configuration.

    However, I am seeing many agents are going on and off on the "agent unreachable for more than 5 minutes" alert. I don't understand why because they are communicating with the Zabbix Server on regular basis, it just seems that they are not sending all the data?

    I've started one of my agent with DebugLevel=4 and it seems very busy with the CPU (see zipped log in the attachment). One thing that puzzles me very much is that for all of my hosts, the active item "ping" is set to refresh every 60 seconds, but sometimes it goes 9 minutes without updating. See the following grep:
    Code:
    /var/log/zabbix# tail -f zabbix_agentd.log | grep --line-buffered ping
                            "key":"agent.ping",
     36763:20140725:154939.031 In add_check() key:'agent.ping' refresh:60 lastlogsize:0 mtime:0
     36763:20140725:154939.033 for key [agent.ping] received value [1]
     36763:20140725:154939.033 In process_value() key:'qatvas-solr01:agent.ping' value:'1'
                            "key":"agent.ping",
                            "key":"agent.ping",
     36763:20140725:155803.531 In add_check() key:'agent.ping' refresh:60 lastlogsize:0 mtime:0
     36763:20140725:155803.537 for key [agent.ping] received value [1]
     36763:20140725:155803.538 In process_value() key:'qatvas-solr01:agent.ping' value:'1'
                            "key":"agent.ping",
                            "key":"agent.ping",
     36763:20140725:160625.294 In add_check() key:'agent.ping' refresh:60 lastlogsize:0 mtime:0
     36763:20140725:160625.296 for key [agent.ping] received value [1]
     36763:20140725:160625.296 In process_value() key:'qatvas-solr01:agent.ping' value:'1'
                            "key":"agent.ping",
    Also, see the weird sparse data I am getting for the CPU (below screenshot). Looking at the agent's log file, it seems super busy checking the CPU, taking measurements every second - but this is not what I see when looking at the graph.



    It is noteworthy that the agent's host is doing OK (no CPU strain, memory good, network to Zabbix Server fine, etc). Same for the server: it's a pretty powerful machine with CPUs at almost 99% idle.

    Finally, the Zabbix Queue is pretty big. Some hosts are lagging behind of more than 10 minutes. I cannot explain why some hosts would be OK (under 30 seconds) and the rest lags. They are all using the same template, same route between agent and server, etc.

    The monitoring --> Overview tab is mostly green (only a few agents unreachable when the server failed to receive the ping for a long time)

    My Zabbix server version is 2.2.2.

    I have no idea on how to investigate this. Please help

    Thank you
    - Alexandre
    Attached Files
    Last edited by myth0s; 28-07-2014, 15:48.
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    On Zabbix server, in zabbix_server.conf, try increasing your Timeout= value to 30 and restart your Zabbix server process. I would do the same for one of your hosts in zabbix_agentd.conf and restart the agent. See if you start getting solid data for that host.

    Next, I would look at this post, at the last paragraph that describes how to check your Zabbix internal processes. Maybe something is overloaded. If you post your graphs similar as mine in this link, please post a 24 hour view.

    Comment

    • myth0s
      Junior Member
      • Jul 2014
      • 8

      #3
      I deleted all my hosts so they would reconnect with only the "Linux active" template. At first, they had connected to Zabbix before I had created the "Linux active" template so I had to unlink them from the first template and relink them to the Active one.

      The server graph had no data (my Zabbix server in the dropdown was red), so I rebooted the server.

      Finally, even though the server CPU and memory was fine, Zabbix reported as much as 24% iowait. So I decided to switch to a machine with more IOPS.

      All in all, the queue is still pretty high (some hosts in 1+minutes, but only one in 10+minutes), but it is better and data is coming in.

      Also, since the hosts are all reporting within 5 minutes, the host unreachable alert is gone for all but one host.

      Thank you for the assistance

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        Originally posted by myth0s

        Finally, even though the server CPU and memory was fine, Zabbix reported as much as 24% iowait. So I decided to switch to a machine with more IOPS.
        Regarding your high iowait, may be worth seeing this post: https://www.zabbix.com/forum/showthread.php?t=38575

        I know you switched machines, but that may still be worth noting.

        Comment

        • myth0s
          Junior Member
          • Jul 2014
          • 8

          #5
          Let me do a follow-up in case anyone ends up here with Google's help.

          I think we managed to solve the issue, and to do so we optimized/change some things:
          • We changed from MySQL MyISAM to MySQL InnoDB
          • We changed the MySQL storage mount to have the noatime flag
          • We disabled Agent ping, Host name and Version of zabbix_agent in the Template App Zabbix Agent Active (hoping that it would solve the "host unreachable" issue
          • We augmented the number of Unreachable poller from 1 to 15 (the Unreachable poller showed up busy 100% on the graph)
          • We also reviewed the discovery rule of our "Active" template. Turns out our default template was cloned from Template OS Linux. The items' type was changed from "Zabbix agent" to "Zabbix agent (active)", but the Discovery Rules and Item prototypes haven't. (Probably the actual cause of all our Host unreachable)
          • Finally we deleted all the hosts from Zabbix to "refresh" everything. Upon startup, the "Queue of items to be update" changed from the previous 600+ items to a small 50-100ish. And they were processed quite promptly (we believe this queue to be erroneous and caused by ZBX-8488)


          The Administration --> Queue - overview table now shows most of our items under the 5 minutes delay.

          Comment

          Working...