Zabbix Version 1.1.7, both agent and client
OS: 32 bit Centos 4.4 (server)
OS: 64 bit Centos 4.4 (clients)
I'm seeing to be what I believe to be an overflow error in the zabbix server.
What I have is one server that has been up for a period of over 5 months and has sent a fair amount of data during that time.
The item I am monitoring is configured as follows:
The statistics from an ifconfig command on two successive invocations are as follows:
The delta's being stored in the database for this item look like this:
If I look at the trends (sorry, I had this problem a long time and the data has moved out of the history table) early on in the history, I see the values I "expect" to see
When I look at the latest trends, I see the grossly exaggerated values
Chart 2c shows the weekly traffic graphs just before the problem appeared
Chart 2d is from an hour later when the problem occurs on the outgoing traffic
Chart 2e is from a week later - you can see the network traffic showing obviously bad values (if only my network card got faster the longer I used it!)
Chart 2b shows when the problem when it started occurring on the ingoing traffic
Finally, Chart 2a shows the last month of traffic with my outgoing traffic being listed as almost 30GB/s
What I am guessing is that numbers being returned to the zabbix server are overflowing the counter that is used to store them.
I've verified that the zabbix client is returning the correct values, as in the big numbers that I am seeing from the ifconfig command (1,959,423,311,117).
Has anyone seen this problem before, or does anyone have any suggestions on how to work around the problem? (yes, I could reboot the server, but I'd rather not since they are production servers).
Thanks in advance for your help.
- Paul
OS: 32 bit Centos 4.4 (server)
OS: 64 bit Centos 4.4 (clients)
I'm seeing to be what I believe to be an overflow error in the zabbix server.
What I have is one server that has been up for a period of over 5 months and has sent a fair amount of data during that time.
The item I am monitoring is configured as follows:
Code:
Description: * Net: eth2: Out Bytes Type: Zabbix Agent Key: [B]net.if.out[eth2][/B] Type Of Information: [B]numeric(float)[/B] Units: B/s Use Multiplier: Do not use Update Interval: 30 Keep History: 7 Keep Trends: 30 Status: Monitored Store Value: [B]Delta (speed per second)[/B]
Code:
eth2 Link encap:Ethernet HWaddr 00:E0:81:42:A2:8A
inet addr:66.237.62.70 Bcast:66.237.62.95 Mask:255.255.255.224
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6206970074 errors:0 dropped:0 overruns:2 frame:2
TX packets:5969353011 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
[B]RX bytes:1381256643448 (1.2 TiB) TX bytes:1959423311117 (1.7 TiB)[/B]
Code:
eth2 Link encap:Ethernet HWaddr 00:E0:81:42:A2:8A
inet addr:66.237.62.70 Bcast:66.237.62.95 Mask:255.255.255.224
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6206970561 errors:0 dropped:0 overruns:2 frame:2
TX packets:5969353459 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
[B]RX bytes:1381256741733 (1.2 TiB) TX bytes:1959423430756 (1.7 TiB)[/B]
Code:
mysql> select * from history where itemid = 17631 limit 10; +--------+------------+------------------+ | itemid | clock | value | +--------+------------+------------------+ | 17631 | 1177312225 | 25380538891.4118 | | 17631 | 1177312252 | 31960873547.0000 | | 17631 | 1177312285 | 26149985370.9091 | | 17631 | 1177312312 | 31961285623.2593 | | 17631 | 1177312345 | 26150311678.3636 | | 17631 | 1177312372 | 31961687237.2222 | | 17631 | 1177312405 | 26150659855.6970 | | 17631 | 1177312432 | 31962095095.3333 | | 17631 | 1177312465 | 26150981869.8788 | | 17631 | 1177312492 | 31962497902.9630 | +--------+------------+------------------+ 10 rows in set (0.00 sec)
Code:
mysql> select * from trends where itemid = 17631 order by clock limit 5; +--------+------------+-----+-----------+-----------+-----------+ | itemid | clock | num | value_min | value_avg | value_max | +--------+------------+-----+-----------+-----------+-----------+ | 17631 | 1154732400 | 51 | 107.6923 | 182.2010 | 582.5000 | | 17631 | 1154736000 | 120 | 42.2800 | 174.7061 | 290.6757 | | 17631 | 1154739600 | 120 | 34.3333 | 173.9369 | 291.3333 | | 17631 | 1154743200 | 120 | 57.7778 | 171.0022 | 253.8974 | | 17631 | 1154746800 | 120 | 34.9259 | 168.4361 | 274.3939 | +--------+------------+-----+-----------+-----------+-----------+ 5 rows in set (0.00 sec)
Code:
mysql> select * from trends where itemid = 17631 order by clock desc limit 5; +--------+------------+-----+------------------+------------------+-------------------+ | itemid | clock | num | value_min | value_avg | value_max | +--------+------------+-----+------------------+------------------+-------------------+ | 17631 | 1177920000 | 5 | 29074763141.8485 | 31659416580.5771 | 35536204022.2222 | | 17631 | 1177916400 | 117 | 8197531095.2222 | 32206601441.0039 | 43611314467.7273 | | 17631 | 1177912800 | 121 | 21794439748.1818 | 33813439024.3971 | 159800714695.1667 | | 17631 | 1177909200 | 120 | 27386203382.8286 | 32353125326.6845 | 38343269614.5200 | | 17631 | 1177905600 | 120 | 24570471886.8462 | 32440411990.4769 | 47912528382.3500 | +--------+------------+-----+------------------+------------------+-------------------+ 5 rows in set (0.00 sec)
Chart 2d is from an hour later when the problem occurs on the outgoing traffic
Chart 2e is from a week later - you can see the network traffic showing obviously bad values (if only my network card got faster the longer I used it!)
Chart 2b shows when the problem when it started occurring on the ingoing traffic
Finally, Chart 2a shows the last month of traffic with my outgoing traffic being listed as almost 30GB/s
What I am guessing is that numbers being returned to the zabbix server are overflowing the counter that is used to store them.
I've verified that the zabbix client is returning the correct values, as in the big numbers that I am seeing from the ifconfig command (1,959,423,311,117).
Has anyone seen this problem before, or does anyone have any suggestions on how to work around the problem? (yes, I could reboot the server, but I'd rather not since they are production servers).
Thanks in advance for your help.
- Paul
Comment