Hi,
I'm having a rather odd problem, where the value cache usage increases every day with about 4%; and after it goes into low memory mode, calculations show the wrong value.
Setup
My setup is a Zabbix server that monitors - among others - some 25 Juniper switches with SNMP. Update intervals and history storage periods are all tuned to be useful, yet not overwhelming for the system and I'm having about 80K total values and 250 values per second. The server is running well.
Now lately, I got asked if it would be possible to detect available switch ports - i.e. if a switch port had not been used for a couple of months. After some ponderance, I added an item prototype to the Template Juniper EX "autodiscovery" prototypes.
Calculated item
This template normally contains "{#SNMPVALUE} - Interface State" that is defined as 1.3.6.1.2.1.2.2.1.8.{#SNMPINDEX}, key name IfOperStatus.[{#SNMPINDEX}]. This item is measured every 5 minutes, history storage is 12 days and trend storage one year. "1" is defined as "Up" for this SNMP value, "2" means "down" and there are some other values that are of lesser importance here. I figured that calculating the minimum value for this item over a longer period would give me exactly what I wanted: if the minimum is "2", it means it hasn't been "up".
I added a new item prototype, I called it "{#SNMPVALUE} - 3 month uptime", key name IfOperStatus3month.[{#SNMPINDEX}], type "calculated" and as formula I used min("IfOperStatus.[{#SNMPINDEX}]",7776000)
Update interval: once per day; history storage 5 days and trend storage one year. I do realise this is a very heavy item prototype, as it will calculate the minimum of about 300 values per day, times 90 days, times roughly 1000+ interfaces, which results in 30.000.000 values having to be fetched from the database in order to calculate the 3 month uptime.
Result
As a result, my value cache went berserk. It used to be 8M, but suddenly rose to about 250M. So I adjusted the value cache to 290M, then later to 400M.
And now what happens:
- value cache usage goes to 60% (of this 400M) in the day after a restart. I consider this normal. But then it slowly creeps up, using about 4% additionally every day, so my remaining 40% is gone in about 10 days. After 8 days, the value cache goes into low memory mode.
- after Zabbix goes into low memory mode, my IfOperStatus3month goes wild, too: I have found several instances of it that have the wrong value, i.e. where "IfOperStatus" clearly was "1" somewhere, but min(IfOperStatus...) came out as "2". Most notably, those wrong values turn to their wrong status at the moment the value cache goes into low memory mode.
Question 1.
I'm suspecting a memory leak here, but I wouldn't know how to test that. I'm willing to try tune the Value Cache once more if that is necessary. But then again: I don't know how to differentiate between a valid value cache sizing issue and a memory leak. Does anyone know what I should test?
Possible workaround
As a workaround, I figured that maybe an intermediate value would help the calculation, and I'd like comments if this is a valid idea. I was thinking of:
- adding "{#SNMPVALUE} - 24 hour uptime", type calculated, key IfOperStatus24hours with formula min("IfOperStatus.[{#SNMPINDEX}]",86400), calculate every 16 hours, store 5 days and retain one year of trend storage
- adding "{#SNMPVALUE} - 3 month uptime", type calculated, with formula min("IfOperStatus24hours.[{#SNMPINDEX}]",7776000).
Question 2.
Would this intermediate step help? Or would it just add to the confusion?
Any help is appreciated.
I'm having a rather odd problem, where the value cache usage increases every day with about 4%; and after it goes into low memory mode, calculations show the wrong value.
Setup
My setup is a Zabbix server that monitors - among others - some 25 Juniper switches with SNMP. Update intervals and history storage periods are all tuned to be useful, yet not overwhelming for the system and I'm having about 80K total values and 250 values per second. The server is running well.
Now lately, I got asked if it would be possible to detect available switch ports - i.e. if a switch port had not been used for a couple of months. After some ponderance, I added an item prototype to the Template Juniper EX "autodiscovery" prototypes.
Calculated item
This template normally contains "{#SNMPVALUE} - Interface State" that is defined as 1.3.6.1.2.1.2.2.1.8.{#SNMPINDEX}, key name IfOperStatus.[{#SNMPINDEX}]. This item is measured every 5 minutes, history storage is 12 days and trend storage one year. "1" is defined as "Up" for this SNMP value, "2" means "down" and there are some other values that are of lesser importance here. I figured that calculating the minimum value for this item over a longer period would give me exactly what I wanted: if the minimum is "2", it means it hasn't been "up".
I added a new item prototype, I called it "{#SNMPVALUE} - 3 month uptime", key name IfOperStatus3month.[{#SNMPINDEX}], type "calculated" and as formula I used min("IfOperStatus.[{#SNMPINDEX}]",7776000)
Update interval: once per day; history storage 5 days and trend storage one year. I do realise this is a very heavy item prototype, as it will calculate the minimum of about 300 values per day, times 90 days, times roughly 1000+ interfaces, which results in 30.000.000 values having to be fetched from the database in order to calculate the 3 month uptime.
Result
As a result, my value cache went berserk. It used to be 8M, but suddenly rose to about 250M. So I adjusted the value cache to 290M, then later to 400M.
And now what happens:
- value cache usage goes to 60% (of this 400M) in the day after a restart. I consider this normal. But then it slowly creeps up, using about 4% additionally every day, so my remaining 40% is gone in about 10 days. After 8 days, the value cache goes into low memory mode.
- after Zabbix goes into low memory mode, my IfOperStatus3month goes wild, too: I have found several instances of it that have the wrong value, i.e. where "IfOperStatus" clearly was "1" somewhere, but min(IfOperStatus...) came out as "2". Most notably, those wrong values turn to their wrong status at the moment the value cache goes into low memory mode.
Question 1.
I'm suspecting a memory leak here, but I wouldn't know how to test that. I'm willing to try tune the Value Cache once more if that is necessary. But then again: I don't know how to differentiate between a valid value cache sizing issue and a memory leak. Does anyone know what I should test?
Possible workaround
As a workaround, I figured that maybe an intermediate value would help the calculation, and I'd like comments if this is a valid idea. I was thinking of:
- adding "{#SNMPVALUE} - 24 hour uptime", type calculated, key IfOperStatus24hours with formula min("IfOperStatus.[{#SNMPINDEX}]",86400), calculate every 16 hours, store 5 days and retain one year of trend storage
- adding "{#SNMPVALUE} - 3 month uptime", type calculated, with formula min("IfOperStatus24hours.[{#SNMPINDEX}]",7776000).
Question 2.
Would this intermediate step help? Or would it just add to the confusion?
Any help is appreciated.
Comment