Hi guys,
Running Zabbix 2.0.5 in a CentOS clustered environment with a remote clustered MySQL database (corosync and pacemaker for both clusters). We have about the following:
Number of hosts (monitored/not monitored/templates) 2158 1999 / 25 / 134
Number of items (monitored/disabled/not supported) 354330 234822 / 16500 / 103008
Number of triggers (enabled/disabled)[problem/unknown/ok] 27151 21129 / 6022 [142 / 0 / 20987]
Required server performance, new values per second 1300.24
We have run into a fairly annoying issue.
I've done a bunch of checking as to possible causes, as well as searched online, and I can't seem to find anyone with a similar issue.
Basically, a large number of our checks seems to be getting double results returned.
For example a simple free disk space % check:
2013.Oct.30 11:16:29 99.746
2013.Oct.30 11:16:19 99.7459
2013.Oct.30 11:11:28 99.746
2013.Oct.30 11:11:18 99.746
2013.Oct.30 11:06:27 99.7461
2013.Oct.30 11:06:18 99.7461
2013.Oct.30 11:01:27 99.7462
2013.Oct.30 11:01:17 99.7462
2013.Oct.30 10:56:26 99.7462
2013.Oct.30 10:56:17 99.7462
This check is scheduled to run every 5 minutes:
Name: Free disk space on $1 (percentage)
Type: Zabbix agent (active)
Key: vfs.fs.size[{#FSNAME},pfree]
Type of information: Numeric (float)
Units: %
Update interval (in sec): 300
Keep history (in days): 20
Keep trends (in days): 365
So I didn't notice this right away, because the datapoints don't always line up so closely. Some machines they are offset/staggered by 30 seconds or so, so you just get a constant flow of data (for example if a check is scheduled to run every 2 min, I might end up getting data for every 1 minute (between the two sets of checks).
I noticed the problem when I started getting really inconsistent graphs for a check:
2013.Oct.30 11:32:39 0
2013.Oct.30 11:32:24 120
2013.Oct.30 11:30:39 0
2013.Oct.30 11:30:24 120
2013.Oct.30 11:28:39 0
2013.Oct.30 11:28:24 120
2013.Oct.30 11:26:39 0
2013.Oct.30 11:26:24 120
2013.Oct.30 11:24:39 0
2013.Oct.30 11:24:24 120
What I've noticed:
- This is across multiple templates/items, and multiple types of checks (both out of box, and custom defined userparameters (some run external scripts)
- It has been going on for a while, I just didn't see it right away
What I've checked so far:
- I've validated that the zabbix_agent.conf files and the zabbix_server.conf files are correct (they appear to be).
- I've tried stopping and starting the agent and server services
- I checked that the cluster secondary node wasn't polling as well (its services are disabled unless the primary server goes down
- I checked that the logs aren't showing any errors
- I tried changing the RefreshActiveChecks parameter to make sure it wasn't that
- I've checked that neither the server or the agent machines have extra processes running, which might return 2 sets of item data for each request
- I've run the checks manually via Zabbix_get, and i get one set of results(sanitized):
[<user>@<server>]~% zabbix_get -s <servername> -k "system.cpu.util[,idle]"
51.507022
The only lead I have so far:
I've found that it seems like it *might* be somehow tied to active checks. When I switch a check to passive, the data comes in as I'd expect. For example:
2013.Oct.30 11:18:14 0.305
2013.Oct.30 11:16:10 0.3575
2013.Oct.30 11:14:05 0.2675
2013.Oct.30 11:12:23 0.0825
2013.Oct.30 11:09:53 0.3275
2013.Oct.30 11:08:12 0.165
2013.Oct.30 11:06:21 0.2425
2013.Oct.30 11:05:07 0.1625 <- First results after I switched the check from active to passive
2013.Oct.30 11:03:48 0.255
2013.Oct.30 11:03:48 0.255
2013.Oct.30 11:01:48 0.27
2013.Oct.30 11:01:48 0.27
2013.Oct.30 10:59:48 0.46
2013.Oct.30 10:59:48 0.46
(Granted there seems to be more variation to the times the data comes in (should be 2 min, but it seems like it fluctuates more)
I've done a bunch of searching, and can't seem to find anyone with the same issue, or anything in the documentation that seems to indicate what it might be.
I figured I'd reach out and see if anyone can think of anything.
Any/all help is appreciated, thanks!
-Zillions
Running Zabbix 2.0.5 in a CentOS clustered environment with a remote clustered MySQL database (corosync and pacemaker for both clusters). We have about the following:
Number of hosts (monitored/not monitored/templates) 2158 1999 / 25 / 134
Number of items (monitored/disabled/not supported) 354330 234822 / 16500 / 103008
Number of triggers (enabled/disabled)[problem/unknown/ok] 27151 21129 / 6022 [142 / 0 / 20987]
Required server performance, new values per second 1300.24
We have run into a fairly annoying issue.
I've done a bunch of checking as to possible causes, as well as searched online, and I can't seem to find anyone with a similar issue.
Basically, a large number of our checks seems to be getting double results returned.
For example a simple free disk space % check:
2013.Oct.30 11:16:29 99.746
2013.Oct.30 11:16:19 99.7459
2013.Oct.30 11:11:28 99.746
2013.Oct.30 11:11:18 99.746
2013.Oct.30 11:06:27 99.7461
2013.Oct.30 11:06:18 99.7461
2013.Oct.30 11:01:27 99.7462
2013.Oct.30 11:01:17 99.7462
2013.Oct.30 10:56:26 99.7462
2013.Oct.30 10:56:17 99.7462
This check is scheduled to run every 5 minutes:
Name: Free disk space on $1 (percentage)
Type: Zabbix agent (active)
Key: vfs.fs.size[{#FSNAME},pfree]
Type of information: Numeric (float)
Units: %
Update interval (in sec): 300
Keep history (in days): 20
Keep trends (in days): 365
So I didn't notice this right away, because the datapoints don't always line up so closely. Some machines they are offset/staggered by 30 seconds or so, so you just get a constant flow of data (for example if a check is scheduled to run every 2 min, I might end up getting data for every 1 minute (between the two sets of checks).
I noticed the problem when I started getting really inconsistent graphs for a check:
2013.Oct.30 11:32:39 0
2013.Oct.30 11:32:24 120
2013.Oct.30 11:30:39 0
2013.Oct.30 11:30:24 120
2013.Oct.30 11:28:39 0
2013.Oct.30 11:28:24 120
2013.Oct.30 11:26:39 0
2013.Oct.30 11:26:24 120
2013.Oct.30 11:24:39 0
2013.Oct.30 11:24:24 120
What I've noticed:
- This is across multiple templates/items, and multiple types of checks (both out of box, and custom defined userparameters (some run external scripts)
- It has been going on for a while, I just didn't see it right away
What I've checked so far:
- I've validated that the zabbix_agent.conf files and the zabbix_server.conf files are correct (they appear to be).
- I've tried stopping and starting the agent and server services
- I checked that the cluster secondary node wasn't polling as well (its services are disabled unless the primary server goes down
- I checked that the logs aren't showing any errors
- I tried changing the RefreshActiveChecks parameter to make sure it wasn't that
- I've checked that neither the server or the agent machines have extra processes running, which might return 2 sets of item data for each request
- I've run the checks manually via Zabbix_get, and i get one set of results(sanitized):
[<user>@<server>]~% zabbix_get -s <servername> -k "system.cpu.util[,idle]"
51.507022
The only lead I have so far:
I've found that it seems like it *might* be somehow tied to active checks. When I switch a check to passive, the data comes in as I'd expect. For example:
2013.Oct.30 11:18:14 0.305
2013.Oct.30 11:16:10 0.3575
2013.Oct.30 11:14:05 0.2675
2013.Oct.30 11:12:23 0.0825
2013.Oct.30 11:09:53 0.3275
2013.Oct.30 11:08:12 0.165
2013.Oct.30 11:06:21 0.2425
2013.Oct.30 11:05:07 0.1625 <- First results after I switched the check from active to passive
2013.Oct.30 11:03:48 0.255
2013.Oct.30 11:03:48 0.255
2013.Oct.30 11:01:48 0.27
2013.Oct.30 11:01:48 0.27
2013.Oct.30 10:59:48 0.46
2013.Oct.30 10:59:48 0.46
(Granted there seems to be more variation to the times the data comes in (should be 2 min, but it seems like it fluctuates more)
I've done a bunch of searching, and can't seem to find anyone with the same issue, or anything in the documentation that seems to indicate what it might be.
I figured I'd reach out and see if anyone can think of anything.
Any/all help is appreciated, thanks!
-Zillions
Comment