Ad Widget

Collapse

Zabbix 2.4.1 "Value Cache is Fully used"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Murmelantes
    Junior Member
    • Feb 2015
    • 6

    #1

    Zabbix 2.4.1 "Value Cache is Fully used"

    Hello,
    I'm using Zabbix to monitor a really small network, only 7 servers, zabbix included.
    I'm using the exact same templates and setup on 2 other networks without issue, and only this one is giving me troubles:
    basically, I'm running out of Cache Mermory in 2 minutes, not matter wich size is defined in the zabbix_server.conf.

    I'm using the default values everywhere, and tried to increase the ValueCacheSize first to 16M, then 32M, and finally increased kernel.shmmax to be able to try with higher values...

    ValueCacheSize is now 256M, and I'm seeing the exact same issue:

    Click image for larger version

Name:	cache usage.png
Views:	1
Size:	23.5 KB
ID:	317114

    All data gathering and internal processes are close to 0%, and I don't see anything special in the logs:

    Code:
     10011:20150218:103343.352 Starting Zabbix Server. Zabbix 2.4.1 (revision 49643).
     10011:20150218:103343.352 ****** Enabled features ******
     10011:20150218:103343.352 SNMP monitoring:           YES
     10011:20150218:103343.352 IPMI monitoring:           YES
     10011:20150218:103343.352 WEB monitoring:            YES
     10011:20150218:103343.352 VMware monitoring:         YES
     10011:20150218:103343.352 Jabber notifications:      YES
     10011:20150218:103343.352 Ez Texting notifications:  YES
     10011:20150218:103343.352 ODBC:                      YES
     10011:20150218:103343.352 SSH2 support:              YES
     10011:20150218:103343.352 IPv6 support:              YES
     10011:20150218:103343.352 ******************************
     10011:20150218:103343.352 using configuration file: /etc/zabbix/zabbix_server.conf
     10011:20150218:103343.364 current database version (mandatory/optional): 02040000/02040000
     10011:20150218:103343.364 required mandatory version: 02040000
     10011:20150218:103343.396 server #0 started [main process]
     10032:20150218:103343.396 server #1 started [configuration syncer #1]
     10033:20150218:103343.396 server #2 started [db watchdog #1]
     10034:20150218:103343.397 server #3 started [poller #1]
     10038:20150218:103343.397 server #6 started [poller #4]
     10039:20150218:103343.399 server #7 started [poller #5]
     10045:20150218:103343.402 server #13 started [trapper #5]
     10049:20150218:103343.403 server #17 started [timer #1]
     10041:20150218:103343.404 server #9 started [trapper #1]
     10046:20150218:103343.404 server #14 started [icmp pinger #1]
     10044:20150218:103343.404 server #12 started [trapper #4]
     10050:20150218:103343.404 server #18 started [http poller #1]
     10048:20150218:103343.405 server #16 started [housekeeper #1]
     10055:20150218:103343.405 server #23 started [history syncer #4]
     10040:20150218:103343.405 server #8 started [unreachable poller #1]
     10042:20150218:103343.405 server #10 started [trapper #2]
     10059:20150218:103343.405 server #27 started [self-monitoring #1]
     10052:20150218:103343.405 server #20 started [history syncer #1]
     10037:20150218:103343.406 server #5 started [poller #3]
     10035:20150218:103343.406 server #4 started [poller #2]
     10047:20150218:103343.407 server #15 started [alerter #1]
     10053:20150218:103343.408 server #21 started [history syncer #2]
     10057:20150218:103343.409 server #25 started [snmp trapper #1]
     10056:20150218:103343.412 server #24 started [escalator #1]
     10043:20150218:103343.419 server #11 started [trapper #3]
     10054:20150218:103343.420 server #22 started [history syncer #3]
     10058:20150218:103343.420 server #26 started [proxy poller #1]
     10051:20150218:103344.051 server #19 started [discoverer #1]
     10052:20150218:103510.179 value cache is fully used: please increase ValueCacheSize configuration parameter
    Any idea what I could do to try to solve this issue?

    Thanks a lot!
  • ingus.vilnis
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Mar 2014
    • 908

    #2
    Hi,

    Can you post here the actual zabbix_server.conf file?

    Maybe you can try to temporary disable all your monitored hosts ensuring that no new values are being received. Then restart the server and see what happens to ValueCache then.

    Best Regards,
    Ingus

    Comment

    • Murmelantes
      Junior Member
      • Feb 2015
      • 6

      #3
      Hmm.
      Such a shame...

      PEBKAC:
      I was declaring ValueCacheSize at 2 different places in the conf file, so it was actually still at 8M! :|

      I still don't quite understand why this specific network needs more than 8M of Value Cache:
      it only has 7 hosts, 824 items, 244 triggers, and requires zabbix treating 11.39 new values per second...

      But I now changed it to 16M, and it looks more or less stable at 37% free for the last 30min, and the cache effectiveness really improved.

      So I guess I can live with it..

      Thanks a lot ingus.vilnis!
      Last edited by Murmelantes; 18-02-2015, 16:54.

      Comment

      • ingus.vilnis
        Senior Member
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Mar 2014
        • 908

        #4
        Still strange even after you found the duplicated line.

        But if so, maybe there are some more misconfigured settings? 37% Value cache free is too bad.

        Can you also show graph with Value cache misses?

        Best Regards,
        Ingus

        Comment

        • Murmelantes
          Junior Member
          • Feb 2015
          • 6

          #5
          zabbix_server.conf looks pretty simple, now:

          Code:
          grep -Ev '(#.*$)|(^$)' /etc/zabbix/zabbix_server.conf
          
          LogFile=/var/log/zabbix/zabbix_server.log
          PidFile=/var/run/zabbix/zabbix_server.pid
          DBName=*****
          DBUser=*****
          DBPassword=*****
          SNMPTrapperFile=/var/log/snmptt/snmptt.log
          StartSNMPTrapper=1
          ValueCacheSize=16M
          AlertScriptsPath=/usr/lib/zabbix/alertscripts
          ExternalScripts=/usr/lib/zabbix/externalscripts
          I'm not sure what could be the cause:
          problems in external scripts I'm using, issues on the monitored servers...
          No idea!

          Regarding the graphs, here is everything I have:
          Zabbix has been installed 6 days ago...

          The ValueCacheSize has been changed to 16M at 15h08, and I restarted the zabbix-server service 3 times after that:

          6009:20150218:150822.683 Starting Zabbix Server. Zabbix 2.4.1 (revision 49643).
          11470:20150218:155535.174 Starting Zabbix Server. Zabbix 2.4.1 (revision 49643).
          11650:20150218:155605.228 Starting Zabbix Server. Zabbix 2.4.1 (revision 49643).
          15839:20150218:162943.585 Starting Zabbix Server. Zabbix 2.4.1 (revision 49643).

          So basically, except around those 3 restart, there's almost no "misses" anymore.

          Click image for larger version

Name:	cache effectiveness total.jpg
Views:	1
Size:	23.2 KB
ID:	312899

          Click image for larger version

Name:	cache effectiveness today.jpg
Views:	1
Size:	23.4 KB
ID:	312898

          Click image for larger version

Name:	cache effectiveness 2h.jpg
Views:	1
Size:	23.8 KB
ID:	312897

          Thanks a lot for your help!
          Last edited by Murmelantes; 18-02-2015, 18:47.

          Comment

          • ingus.vilnis
            Senior Member
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Mar 2014
            • 908

            #6
            Hi,

            OMG, 16 Kvps Value cache hits with 7 hosts? Normally you should have 16 vps not 16 000 vps with environment of such size.

            In such case set your Value Cache size to 128 or 256 or even 512M. 16M is not enough and your server will crash again as soon as you add another host.

            Anyways you must find out why does it happen. Therefore could you please show a screenshot with the triggers and especially trigger expressions you have on your hosts? (blur out confidential details if any)

            Best Regards,
            Ingus

            Comment

            • Murmelantes
              Junior Member
              • Feb 2015
              • 6

              #7
              I found the culprit: I've been really optimistic setting up my triggers.

              I created 2 new triggers to compare the incoming and outgoing traffic on all interfaces to the average daily minimum.

              Basically, I was calculating the average minimum for the last week, and triggering an alarm when when the traffic was half this average.

              The alarm was cleared when the traffic was back to half the average traffic.

              Code:
              ({TRIGGER.VALUE}=0 &
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].max(10m)}
              -(({Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,1d)} + 
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,2d)} + 
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,3d)} + 
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,4d)} + 
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,5d)} + 
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,6d)} + 
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(1d,7d)})/14)<0) |
              ({TRIGGER.VALUE}=1 &
              {Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].min(10m)}
              -(({Template SNMP Interfaces:ifInOctets[{#SNMPVALUE}].avg(7d)})/2)<0)
              It was working perfectly, but was too heavy...

              It would maybe be possible to recreate a similar behavior through calculated items?
              I really liked this alarm...

              Anyway, I'm back to more normal values (600 vps and 97% cache free), and I still have some "overly optimistic" triggers that I can disable.
              So I now know where the issue was coming from...

              Thanks again a lot for your help!
              Last edited by Murmelantes; 19-02-2015, 13:36.

              Comment

              • ingus.vilnis
                Senior Member
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Mar 2014
                • 908

                #8
                Yes, you were storing basically all collected values for those items for 7 days in the value cache and that is why it failed. Using long time shift functions takes a lot of value cache.

                Anyways I am glad you found the problem!

                Yes, you can think of calculated items but again be careful to not use time based trigger functions so much that it crashes your value cache.

                Or as another option add more RAM. It is said that a fast Zabbix setup is where whole DB fits into RAM so you still have space for improvement.

                Best Regards,
                Ingus

                Comment

                Working...