Ad Widget

Collapse

Zabbix getting 2 values for each (active?) agent item check

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zillions
    Junior Member
    • Jan 2013
    • 22

    #1

    Zabbix getting 2 values for each (active?) agent item check

    Hi guys,
    Running Zabbix 2.0.5 in a CentOS clustered environment with a remote clustered MySQL database (corosync and pacemaker for both clusters). We have about the following:

    Number of hosts (monitored/not monitored/templates) 2158 1999 / 25 / 134
    Number of items (monitored/disabled/not supported) 354330 234822 / 16500 / 103008
    Number of triggers (enabled/disabled)[problem/unknown/ok] 27151 21129 / 6022 [142 / 0 / 20987]
    Required server performance, new values per second 1300.24

    We have run into a fairly annoying issue.
    I've done a bunch of checking as to possible causes, as well as searched online, and I can't seem to find anyone with a similar issue.

    Basically, a large number of our checks seems to be getting double results returned.

    For example a simple free disk space % check:
    2013.Oct.30 11:16:29 99.746
    2013.Oct.30 11:16:19 99.7459
    2013.Oct.30 11:11:28 99.746
    2013.Oct.30 11:11:18 99.746
    2013.Oct.30 11:06:27 99.7461
    2013.Oct.30 11:06:18 99.7461
    2013.Oct.30 11:01:27 99.7462
    2013.Oct.30 11:01:17 99.7462
    2013.Oct.30 10:56:26 99.7462
    2013.Oct.30 10:56:17 99.7462

    This check is scheduled to run every 5 minutes:
    Name: Free disk space on $1 (percentage)
    Type: Zabbix agent (active)
    Key: vfs.fs.size[{#FSNAME},pfree]
    Type of information: Numeric (float)
    Units: %
    Update interval (in sec): 300
    Keep history (in days): 20
    Keep trends (in days): 365

    So I didn't notice this right away, because the datapoints don't always line up so closely. Some machines they are offset/staggered by 30 seconds or so, so you just get a constant flow of data (for example if a check is scheduled to run every 2 min, I might end up getting data for every 1 minute (between the two sets of checks).

    I noticed the problem when I started getting really inconsistent graphs for a check:
    2013.Oct.30 11:32:39 0
    2013.Oct.30 11:32:24 120
    2013.Oct.30 11:30:39 0
    2013.Oct.30 11:30:24 120
    2013.Oct.30 11:28:39 0
    2013.Oct.30 11:28:24 120
    2013.Oct.30 11:26:39 0
    2013.Oct.30 11:26:24 120
    2013.Oct.30 11:24:39 0
    2013.Oct.30 11:24:24 120


    What I've noticed:
    - This is across multiple templates/items, and multiple types of checks (both out of box, and custom defined userparameters (some run external scripts)
    - It has been going on for a while, I just didn't see it right away

    What I've checked so far:
    - I've validated that the zabbix_agent.conf files and the zabbix_server.conf files are correct (they appear to be).
    - I've tried stopping and starting the agent and server services
    - I checked that the cluster secondary node wasn't polling as well (its services are disabled unless the primary server goes down
    - I checked that the logs aren't showing any errors
    - I tried changing the RefreshActiveChecks parameter to make sure it wasn't that
    - I've checked that neither the server or the agent machines have extra processes running, which might return 2 sets of item data for each request
    - I've run the checks manually via Zabbix_get, and i get one set of results(sanitized):
    [<user>@<server>]~% zabbix_get -s <servername> -k "system.cpu.util[,idle]"
    51.507022

    The only lead I have so far:
    I've found that it seems like it *might* be somehow tied to active checks. When I switch a check to passive, the data comes in as I'd expect. For example:
    2013.Oct.30 11:18:14 0.305
    2013.Oct.30 11:16:10 0.3575
    2013.Oct.30 11:14:05 0.2675
    2013.Oct.30 11:12:23 0.0825
    2013.Oct.30 11:09:53 0.3275
    2013.Oct.30 11:08:12 0.165
    2013.Oct.30 11:06:21 0.2425
    2013.Oct.30 11:05:07 0.1625 <- First results after I switched the check from active to passive
    2013.Oct.30 11:03:48 0.255
    2013.Oct.30 11:03:48 0.255
    2013.Oct.30 11:01:48 0.27
    2013.Oct.30 11:01:48 0.27
    2013.Oct.30 10:59:48 0.46
    2013.Oct.30 10:59:48 0.46
    (Granted there seems to be more variation to the times the data comes in (should be 2 min, but it seems like it fluctuates more)

    I've done a bunch of searching, and can't seem to find anyone with the same issue, or anything in the documentation that seems to indicate what it might be.
    I figured I'd reach out and see if anyone can think of anything.
    Any/all help is appreciated, thanks!
    -Zillions
    Last edited by zillions; 30-10-2013, 21:11.
  • zillions
    Junior Member
    • Jan 2013
    • 22

    #2
    To update, I think it's not related to the agent, as I just checked a bunch of our network gear, and our SNMP based polling from the server shows the same issue.

    (An snmp based network traffic item for an interface on a Cisco device)
    2013.Oct.30 11:52:52 2200584
    2013.Oct.30 11:51:32 2399416
    2013.Oct.30 11:50:32 2011440
    2013.Oct.30 11:49:31 3255760
    2013.Oct.30 11:48:18 3383824
    2013.Oct.30 11:47:14 2767448
    2013.Oct.30 11:46:17 2313736
    2013.Oct.30 11:45:17 2355464

    This check is set to check every 120 seconds.
    These checks aren't duplicating at the same time, but you can see that the :45, :47, :49 checks line up, and the :46, :48, :50 checks line up, etc...

    So it's something on the server I think, but not sure what.

    Comment

    • zillions
      Junior Member
      • Jan 2013
      • 22

      #3
      I figured it out!!!!
      The SNMP side I need to do more digging on, but I ended up figuring out the agent problem!

      The issue itself was caused due to a configuration setting for our clustered servers. When we first setup the Zabbix servers, we were having problems with fwd/reverse DNS, so we put both the behind the scenes IP's, and the load balanced, clustered IP in the ServerActive line in the /etc/zabbix/zabbix_agent.conf file.

      The problem is that this actually causes the agent to send the data to ALL the IP's you put there.
      Once the DNS stuff got fixed, we never went back and removed the direct IP's. That is what caused the problem.

      In zabbix_agent.conf, I did the following:
      OLD: ServerActive=<Cluster IP>, <Server1 IP>, <Server2 IP>
      NEW: ServerActive=<Cluster IP>

      The checks went from :
      2013.Oct.30 14:07:55 51.366
      2013.Oct.30 14:07:54 51.8421
      2013.Oct.30 14:05:54 62.8456
      2013.Oct.30 14:05:54 62.8456
      2013.Oct.30 14:03:54 51.3763
      2013.Oct.30 14:03:54 51.3763

      To:
      Timestamp Value
      2013.Oct.30 14:12:13 53.2207
      2013.Oct.30 14:10:13 56.9139
      2013.Oct.30 14:08:13 20.4989

      So the problem appears to be that when the Zabbix agent responds, it sends to EVERY server on that list. That means that <Clustered IP> gets it, but also <server1 IP> and <server2 IP>.
      Since the Zabbix server is technically reachable via both IP's, it's accepting both data responses.

      The issue wasn't manifested via the inactive checks, because those are initiated on the server side, and only one request is made!

      This is going to result in a very high priority ticket for a puppet change, because this is screwing up a number of triggers/alerts that I've been fighting with. I need this changed to just the shared IP.

      I'm soooooo pumped!!!
      This was a very obscure issue to track down, but I got it!!

      Comment

      • Pada
        Senior Member
        • Apr 2012
        • 236

        #4
        I've also had this kind of thing with Active Zabbix agents when I cloned my VM and forgot to change the cloned VM's Zabbix agent configuration file, which resulted in 2 virtual hosts sending data as a single host in Zabbix.

        Comment

        Working...