Ad Widget

Collapse

Zabbix server crashing due to Trendsize cache

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • attilla
    Junior Member
    Zabbix Certified Specialist
    • Feb 2011
    • 25

    #1

    Zabbix server crashing due to Trendsize cache

    Hi,

    I have put my new server in production and started the discovery process.

    Unfortunately, as soon as a large part of my hosts were added, the server crashed with the notification that I should increase the TrendSizeCache, so I did and also increased the shared memory.

    Code:
    CacheSize = 1G           
    TrendSizeCache = 1G      
    HistoryCacheSize 128M    
    HistoryTextCacheSize 128M
    Code:
    kernel.shmall = 2097152                     
    kernel.shmmax = 4294967296                  
    kernel.shmmni = 4096                        
    # semaphores: semmsl, semmns, semopm, semmni
    kernel.sem = 250 32000 100 128
    As you can see, I even increased it to its maximum value, but it keeps crashing. The following are the debug logs:

    Code:
      7164:20110423:110459.997 In DCmass_add_history()                                                      
      7164:20110423:110459.997 Query [txnlev:1] [insert into history (itemid,clock,value) values (149407,1303549497,0.000000),(103507,1303549497,59988530.320083),(45307,1303549497,2.016871),(111307,1303549497,0.000000);]                                                                                                       
      7164:20110423:110459.997 In DCmass_update_triggers()                                                  
      7164:20110423:110459.997 Query [txnlev:1] [select distinct t.triggerid,t.type,t.value,t.error,t.expression,f.itemid from triggers t,functions f,items i where t.triggerid=f.triggerid and f.itemid=i.itemid and t.status=0 and f.itemid in (149407,103507,45307,111307) order by t.triggerid]                         
      7164:20110423:110459.998 End of DCmass_update_triggers()                                              
      7164:20110423:110459.998 In DCmass_update_trends()                                                    
      7164:20110423:110459.998 __mem_malloc: skipped 0 asked 24 skip_min 4294967295 skip_max 0              
      7164:20110423:110459.998 [file:dbcache.c,line:2662] zbx_mem_malloc(): out of memory (requested 24 bytes)                                                                                                      
      7164:20110423:110459.998 [file:dbcache.c,line:2662] zbx_mem_malloc(): please increase TrendCacheSize configuration parameter
    Available trend cache values:

    Code:
    2011.Apr.23 11:00:00	61.465
    2011.Apr.23 10:57:11	92.9304
    It takes about 9 minutes before the crash occurs, so not really useful statistics. Can anyone help me with solving this issue?

    Specs:
    Dual quad-core Xeon
    36gb Mem
    2x Intel 510 256GB SSD

    Code:
    Number of hosts (monitored/not monitored/templates)	248	121 / 1 / 126
    Number of items (monitored/disabled/not supported)	72903	71616 / 1174 / 113
    Number of triggers (enabled/disabled)[problem/unknown/ok]	2649	2620 / 29  [10 / 765 / 1845]
    Number of users (online)	14	1
    Required server performance, new values per second	254.56	 -
  • attilla
    Junior Member
    Zabbix Certified Specialist
    • Feb 2011
    • 25

    #2
    Forgot to mention that I'm running 1.8.5 on Debian squeeze.
    Last edited by attilla; 23-04-2011, 14:05.

    Comment

    • richlv
      Senior Member
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Oct 2005
      • 3112

      #3
      first, you probably don't need such a large cachesize - monitor config cache (buffer) usage and set this appropriately.

      second, are those figures for the trend cache free or used percentage ?
      Zabbix 3.0 Network Monitoring book

      Comment

      • attilla
        Junior Member
        Zabbix Certified Specialist
        • Feb 2011
        • 25

        #4
        I only increased the values after the crashes started and I didn't increase it to the maximum value at once. I guessed as much that I wouldn't need such a ridiculous amount, but I wanted to make sure that the crash wasn't caused by the mentioned setting.

        The numbers are the % free cache. So I'm pretty sure that there is enough cache. ;-)

        As you can see from the numbers, I don't have a huge number of hosts, but some hosts have a large number of items (1500-5000, all snmp). These hosts are Brocade switch chassis with multiple linecard with 10GE or 1GE ports. Every port has about 25 items that I'm polling. Each host is linked to multiple templates, one for each linecard.
        Last edited by attilla; 23-04-2011, 18:43.

        Comment

        • attilla
          Junior Member
          Zabbix Certified Specialist
          • Feb 2011
          • 25

          #5
          Well, I debugged the issue even further and I think that there is a bug in the trend cache of some sort. I enabled only half of my hosts and what you can see is that it doesn't crash that soon, so I now have nice trend cache graphs (see attached image).

          I disabled half of my large hosts and you see that it kept running now, but since I saw the available trend cache drop I increased the limit again from 128M to 1G, but as you can see, it doesn't matter. So I enabled some extra hosts and you see that it drops towards zero, resulting in the crash again.

          So I'm guessing that the trend cache isn't emptied quick enough. Is there something I can change myself with regards to this issue?
          Attached Files
          Last edited by attilla; 24-04-2011, 00:37.

          Comment

          • attilla
            Junior Member
            Zabbix Certified Specialist
            • Feb 2011
            • 25

            #6
            Ok, well, things changed a bit last night. After I saw it drop again and I posted my last reply I went to bed. But, somehow it kept at 1,38% free trend cache and the server was polling all night without any noticeable problems.

            Now I still had the cache set to 1G again and I wanted to see if it kept running after decreasing the value again. And it did, it kept running with again the same amount of free trend cache.

            So I checked what happened if I would add another two hosts and yes, it crashed again. So I'm pretty sure that there is some kind of bug, because if you look at the statistics (see attached) all looks fine and well within specs.

            Current settings
            Code:
            CacheSize = 64M       
            TrendSizeCache = 64M  
            HistoryCacheSize = 16M
            Some timestamps when I did what:

            9:17 restarted server to see if it kept running and deactivated some external checks to see if that was the cause
            9:42 added two extra hosts
            9:56 crashed
            9:57 restarted without the two extra hosts
            10:22 restarted with decreased cache (256M)
            10:37 restarted with decreased cache (64M)
            11:09 added two extra hosts (different ones then last time)
            11:10 crashed
            11:12 restarted without extra hosts
            Attached Files

            Comment

            • attilla
              Junior Member
              Zabbix Certified Specialist
              • Feb 2011
              • 25

              #7
              Ok, downgraded to 1.8.4, still crashes. Downgraded to 1.8.2, no crashes anymore. :-)

              So there is a bug somewhere. ;-)
              Last edited by attilla; 24-04-2011, 21:11.

              Comment

              • richlv
                Senior Member
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Oct 2005
                • 3112

                #8
                a few things

                1. cachesize option is for config cache only and it's usage will be fairly constant if the zabbix config does not change much. monitor it and set this option with some room for growth. but that's not related to the problem you have

                2. trend cache. not to dive too deep into it, but you are running with the default - 4 megs. you are monitoring percentage, so you didn't spot this

                why ? check the config variable name...
                Zabbix 3.0 Network Monitoring book

                Comment

                • attilla
                  Junior Member
                  Zabbix Certified Specialist
                  • Feb 2011
                  • 25

                  #9
                  Ok, that is so stupid... :-(
                  I must have written it wrong somewhere in the beginning.

                  Please, delete this thread.

                  Comment

                  • efrain02
                    Banned
                    • Apr 2011
                    • 81

                    #10
                    I don't understand what did you do to fix this. Could you tell me please? I have the same problem.
                    Thanks

                    Comment

                    • attilla
                      Junior Member
                      Zabbix Certified Specialist
                      • Feb 2011
                      • 25

                      #11
                      Originally posted by efrain02
                      I don't understand what did you do to fix this. Could you tell me please? I have the same problem.
                      Thanks
                      TrendSizeCache != TrendCacheSize :-)

                      I misspelled it. :-)

                      Comment

                      • efrain02
                        Banned
                        • Apr 2011
                        • 81

                        #12
                        Ok. My problem is kinda similar to yours the difference is that in mine ask me for more cache size. I've done that but in the log file still shows:

                        Code:
                         31239:20110425:062615.895 Zabbix Server stopped. Zabbix 1.8.4 (revision 16604).
                         21640:20110427:013904.047 Starting Zabbix Server. Zabbix 1.8.4 (revision 16604)                                                                             .
                         21640:20110427:013904.075 ****** Enabled features ******
                         21640:20110427:013904.075 SNMP monitoring:           YES
                         21640:20110427:013904.075 IPMI monitoring:            NO
                         21640:20110427:013904.075 WEB monitoring:            YES
                         21640:20110427:013904.075 Jabber notifications:      YES
                         21640:20110427:013904.076 Ez Texting notifications:  YES
                         21640:20110427:013904.076 ODBC:                       NO
                         21640:20110427:013904.076 SSH2 support:               NO
                         21640:20110427:013904.076 IPv6 support:               NO
                         21640:20110427:013904.076 ******************************
                         21657:20110427:013904.551 server #1 started [DB Cache]
                         21664:20110427:013904.589 server #8 started [Trapper]
                         21665:20110427:013904.589 server #9 started [Trapper]
                         21666:20110427:013904.590 server #10 started [Trapper]
                         21667:20110427:013904.595 server #11 started [Trapper]
                         21668:20110427:013904.598 server #12 started [Trapper]
                         21669:20110427:013904.602 server #13 started [ICMP pinger]
                         21670:20110427:013904.602 server #14 started [Alerter]
                         21671:20110427:013904.603 server #15 started [Housekeeper]
                         21671:20110427:013904.603 Executing housekeeper
                         21674:20110427:013904.608 server #17 started [HTTP Poller]
                         21673:20110427:013904.608 server #16 started [Timer]
                         21677:20110427:013904.620 server #19 started [DB Syncer]
                         21679:20110427:013904.623 server #20 started [DB Syncer]
                         21681:20110427:013904.623 server #21 started [DB Syncer]
                         21682:20110427:013904.623 server #22 started [DB Syncer]
                         21683:20110427:013904.624 server #23 started [Escalator]
                         21684:20110427:013904.624 server #24 started [Proxy Poller]
                         21640:20110427:013904.632 server #0 started [Watchdog]
                         21660:20110427:013905.088 server #4 started [Poller. SNMP:YES]
                         21675:20110427:013905.089 server #18 started [Discoverer. SNMP:YES]
                         21659:20110427:013905.092 server #3 started [Poller. SNMP:YES]
                         21662:20110427:013905.093 server #6 started [Poller. SNMP:YES]
                         21661:20110427:013905.098 server #5 started [Poller. SNMP:YES]
                         21658:20110427:013905.100 server #2 started [Poller. SNMP:YES]
                         21663:20110427:013905.105 server #7 started [Poller for unreachable hosts. SNMP                                                                             :YES]
                         21657:20110427:013906.664 __mem_malloc: skipped 0 asked 40 skip_min 4294967295                                                                              skip_max 0
                         21657:20110427:013906.664 [file:dbconfig.c,line:1221] zbx_mem_malloc(): out of                                                                              memory (requested 36 bytes).
                         21657:20110427:013906.664 [file:dbconfig.c,line:1221] zbx_mem_malloc(): please                                                                              increase CacheSize configuration parameter.
                         21640:20110427:013906.668 One child process died (PID:21657,exitcode/signal:255                                                                             ). Exiting ...
                         21640:20110427:013908.687 Syncing history data...
                         21640:20110427:013908.796 Syncing history data... done.
                         21640:20110427:013908.796 Syncing trends data...
                         21640:20110427:013908.906 Syncing trends data... done.
                         21640:20110427:013908.907 Zabbix Server stopped. Zabbix 1.8.4 (revision 16604).
                         22069:20110427:014032.785 Starting Zabbix Server. Zabbix 1.8.4 (revision 16604)                                                                             .
                         22069:20110427:014032.786 ****** Enabled features ******
                         22069:20110427:014032.786 SNMP monitoring:           YES
                         22069:20110427:014032.786 IPMI monitoring:            NO
                         22069:20110427:014032.786 WEB monitoring:            YES
                         22069:20110427:014032.786 Jabber notifications:      YES
                         22069:20110427:014032.786 Ez Texting notifications:  YES
                         22069:20110427:014032.786 ODBC:                       NO
                         22069:20110427:014032.786 SSH2 support:               NO
                         22069:20110427:014032.786 IPv6 support:               NO
                         22069:20110427:014032.786 ******************************
                         22070:20110427:014032.976 server #1 started [DB Cache]
                         22071:20110427:014033.030 server #2 started [Poller. SNMP:YES]
                         22077:20110427:014033.072 server #8 started [Trapper]
                         22074:20110427:014033.079 server #5 started [Poller. SNMP:YES]
                         22078:20110427:014033.081 server #9 started [Trapper]
                         22079:20110427:014033.083 server #10 started [Trapper]
                         22080:20110427:014033.085 server #11 started [Trapper]
                         22081:20110427:014033.089 server #12 started [Trapper]
                         22082:20110427:014033.091 server #13 started [ICMP pinger]
                         22084:20110427:014033.098 server #14 started [Alerter]
                         22086:20110427:014033.099 server #15 started [Housekeeper]
                         22086:20110427:014033.100 Executing housekeeper
                         22088:20110427:014033.102 server #16 started [Timer]
                         22089:20110427:014033.104 server #17 started [HTTP Poller]
                         22072:20110427:014033.107 server #3 started [Poller. SNMP:YES]
                         22073:20110427:014033.123 server #4 started [Poller. SNMP:YES]
                         22090:20110427:014033.165 server #18 started [Discoverer. SNMP:YES]
                         22096:20110427:014033.167 server #19 started [DB Syncer]
                         22098:20110427:014033.178 server #20 started [DB Syncer]
                         22076:20110427:014033.224 server #7 started [Poller for unreachable hosts. SNMP                                                                             :YES]
                         22075:20110427:014033.258 server #6 started [Poller. SNMP:YES]
                         22102:20110427:014033.271 server #21 started [DB Syncer]
                         22105:20110427:014033.273 server #22 started [DB Syncer]
                         22107:20110427:014033.283 server #23 started [Escalator]
                         22108:20110427:014033.283 server #24 started [Proxy Poller]
                         22069:20110427:014033.285 server #0 started [Watchdog]
                         22070:20110427:014034.399 __mem_malloc: skipped 0 asked 40 skip_min 4294967295                                                                              skip_max 0
                         22070:20110427:014034.399 [file:dbconfig.c,line:1221] zbx_mem_malloc(): out of                                                                              memory (requested 36 bytes).
                         22070:20110427:014034.399 [file:dbconfig.c,line:1221] zbx_mem_malloc(): please                                                                              increase CacheSize configuration parameter.
                         22069:20110427:014034.403 One child process died (PID:22070,exitcode/signal:255                                                                             ). Exiting ...
                         22069:20110427:014036.435 Syncing history data...
                         22069:20110427:014036.465 Syncing history data... done.
                         22069:20110427:014036.465 Syncing trends data...
                         22069:20110427:014036.488 Syncing trends data... done.
                         22069:20110427:014036.488 Zabbix Server stopped. Zabbix 1.8.4 (revision 16604).
                         22116:20110427:014045.958 Starting Zabbix Server. Zabbix 1.8.4 (revision 16604)                                                                             .
                         22116:20110427:014045.958 ****** Enabled features ******
                         22116:20110427:014045.959 SNMP monitoring:           YES
                         22116:20110427:014045.959 IPMI monitoring:            NO
                         22116:20110427:014045.959 WEB monitoring:            YES
                         22116:20110427:014045.959 Jabber notifications:      YES
                         22116:20110427:014045.959 Ez Texting notifications:  YES
                         22116:20110427:014045.959 ODBC:                       NO
                         22116:20110427:014045.959 SSH2 support:               NO
                         22116:20110427:014045.959 IPv6 support:               NO
                         22116:20110427:014045.959 ******************************
                         22117:20110427:014046.167 server #1 started [DB Cache]
                         22119:20110427:014046.248 server #3 started [Poller. SNMP:YES]
                         22118:20110427:014046.248 server #2 started [Poller. SNMP:YES]
                         22126:20110427:014046.251 server #10 started [Trapper]
                         22124:20110427:014046.252 server #8 started [Trapper]
                         22121:20110427:014046.264 server #5 started [Poller. SNMP:YES]
                         22125:20110427:014046.265 server #9 started [Trapper]
                         22127:20110427:014046.269 server #11 started [Trapper]
                         22120:20110427:014046.272 server #4 started [Poller. SNMP:YES]
                         22123:20110427:014046.285 server #7 started [Poller for unreachable hosts. SNMP                                                                             :YES]
                         22129:20110427:014046.287 server #12 started [Trapper]
                         22130:20110427:014046.288 server #13 started [ICMP pinger]
                         22134:20110427:014046.292 server #14 started [Alerter]
                         22136:20110427:014046.295 server #15 started [Housekeeper]
                         22136:20110427:014046.295 Executing housekeeper
                         22138:20110427:014046.297 server #16 started [Timer]
                         22122:20110427:014046.314 server #6 started [Poller. SNMP:YES]
                         22141:20110427:014046.319 server #17 started [HTTP Poller]
                         22144:20110427:014046.361 server #19 started [DB Syncer]
                         22146:20110427:014046.373 server #20 started [DB Syncer]
                         22142:20110427:014046.374 server #18 started [Discoverer. SNMP:YES]
                         22147:20110427:014046.375 server #21 started [DB Syncer]
                         22151:20110427:014046.377 server #22 started [DB Syncer]
                         22153:20110427:014046.387 server #23 started [Escalator]
                         22116:20110427:014046.416 server #0 started [Watchdog]
                         22155:20110427:014046.420 server #24 started [Proxy Poller]
                         22117:20110427:014047.537 __mem_malloc: skipped 0 asked 40 skip_min 4294967295                                                                              skip_max 0
                         22117:20110427:014047.537 [file:dbconfig.c,line:1221] zbx_mem_malloc(): out of                                                                              memory (requested 36 bytes).
                         22117:20110427:014047.537 [file:dbconfig.c,line:1221] zbx_mem_malloc(): please                                                                              increase CacheSize configuration parameter.
                         22116:20110427:014047.541 One child process died (PID:22117,exitcode/signal:255                                                                             ). Exiting ...
                         22116:20110427:014049.569 Syncing history data...
                         22116:20110427:014049.569 Syncing history data... done.
                         22116:20110427:014049.569 Syncing trends data...
                         22116:20110427:014049.570 Syncing trends data... done.
                         22116:20110427:014049.570 Zabbix Server stopped. Zabbix 1.8.4 (revision 16604).
                         23745:20110427:015000.633 Starting Zabbix Server. Zabbix 1.8.4 (revision 16604)                                                                             .
                         23745:20110427:015000.633 ****** Enabled features ******
                         23745:20110427:015000.633 SNMP monitoring:           YES
                         23745:20110427:015000.633 IPMI monitoring:            NO
                         23745:20110427:015000.633 WEB monitoring:            YES
                         23745:20110427:015000.633 Jabber notifications:      YES
                         23745:20110427:015000.634 Ez Texting notifications:  YES
                         23745:20110427:015000.634 ODBC:                       NO
                         23745:20110427:015000.634 SSH2 support:               NO
                         23745:20110427:015000.635 IPv6 support:               NO
                         23745:20110427:015000.636 ******************************
                         23746:20110427:015000.703 server #1 started [DB Cache]
                         23748:20110427:015000.785 server #3 started [Poller. SNMP:YES]
                         23747:20110427:015000.786 server #2 started [Poller. SNMP:YES]
                         23753:20110427:015000.788 server #8 started [Trapper]
                         23750:20110427:015000.808 server #5 started [Poller. SNMP:YES]
                         23752:20110427:015000.811 server #7 started [Poller for unreachable hosts. SNMP                                                                             :YES]
                         23754:20110427:015000.812 server #9 started [Trapper]
                         23751:20110427:015000.814 server #6 started [Poller. SNMP:YES]
                         23755:20110427:015000.815 server #10 started [Trapper]
                         23757:20110427:015000.819 server #11 started [Trapper]
                         23758:20110427:015000.820 server #12 started [Trapper]
                         23761:20110427:015000.822 server #13 started [ICMP pinger]
                         23763:20110427:015000.824 server #14 started [Alerter]
                         23768:20110427:015000.824 server #19 started [DB Syncer]
                         23745:20110427:015000.824 server #0 started [Watchdog]
                         23769:20110427:015000.824 server #20 started [DB Syncer]
                         23772:20110427:015000.827 server #23 started [Escalator]
                         23771:20110427:015000.827 server #22 started [DB Syncer]
                         23770:20110427:015000.827 server #21 started [DB Syncer]
                         23773:20110427:015000.827 server #24 started [Proxy Poller]
                         23766:20110427:015000.848 server #17 started [HTTP Poller]
                         23765:20110427:015000.851 server #16 started [Timer]
                         23764:20110427:015000.863 server #15 started [Housekeeper]
                         23764:20110427:015000.863 Executing housekeeper
                         23767:20110427:015000.906 server #18 started [Discoverer. SNMP:YES]
                         23749:20110427:015000.909 server #4 started [Poller. SNMP:YES]
                         23746:20110427:015002.029 __mem_malloc: skipped 0 asked 40 skip_min 4294967295                                                                              skip_max 0
                         23746:20110427:015002.029 [file:dbconfig.c,line:1221] zbx_mem_malloc(): out of                                                                              memory (requested 36 bytes).
                         23746:20110427:015002.029 [file:dbconfig.c,line:1221] zbx_mem_malloc(): please                                                                              increase CacheSize configuration parameter.
                         23745:20110427:015002.033 One child process died (PID:23746,exitcode/signal:255                                                                             ). Exiting ...
                         23745:20110427:015004.048 Syncing history data...
                         23745:20110427:015004.048 Syncing history data... done.
                         23745:20110427:015004.048 Syncing trends data...
                         23745:20110427:015004.048 Syncing trends data... done.
                         23745:20110427:015004.048 Zabbix Server stopped. Zabbix 1.8.4 (revision 16604).

                        Comment

                        • attilla
                          Junior Member
                          Zabbix Certified Specialist
                          • Feb 2011
                          • 25

                          #13
                          How much is it set to and what are the specs of your setup?

                          Comment

                          • efrain02
                            Banned
                            • Apr 2011
                            • 81

                            #14
                            Well the cache sizes are:

                            Code:
                            CacheSize = 64M       
                            TrendCacheSize = 64M  
                            HistoryCacheSize = 16M
                            And the setup of zabbix:
                            Code:
                            global $DB;
                            
                            $DB["TYPE"]             = 'MYSQL';
                            $DB["SERVER"]           = 'localhost';
                            $DB["PORT"]             = '0';
                            $DB["DATABASE"]         = 'the name of my db';
                            $DB["USER"]             = 'the username';
                            $DB["PASSWORD"]         = 'and the password';
                            $ZBX_SERVER             = 'localhost';
                            $ZBX_SERVER_PORT        = '10051';
                            
                            $IMAGE_FORMAT_DEFAULT   = IMAGE_FORMAT_PNG;
                            I can't publish the database, username and password. But they are correct.
                            Last edited by efrain02; 06-05-2011, 19:04.

                            Comment

                            • attilla
                              Junior Member
                              Zabbix Certified Specialist
                              • Feb 2011
                              • 25

                              #15
                              I see you nicely copied my typo. :-)

                              Please change TrendSizeCache into TrendCacheSize and your problem should be solved.

                              Comment

                              Working...