Ad Widget

Collapse

Ridiculously low performance threshold in 1.8

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • untergeek
    Senior Member
    Zabbix Certified Specialist
    • Jun 2009
    • 512

    #1

    Ridiculously low performance threshold in 1.8

    Some points of interest:
    Solaris 10 running on a SPARC-Enterprise-T5120 with 16G of RAM.
    Zabbix 1.8 (or trunk, it flunks with either)
    Oracle 10.2.0.1 (running on a remote host, accessed via GigE).

    Presently, zabbix_server.conf is set to default on everything but the MUST configures, like hostname and db setup. We've tried it with differing values cranked up to the max and it doesn't make a bit of difference (which leads me to believe it's database performance related).

    Number of hosts (monitored/not monitored/templates) 89 49 / 0 / 40
    Number of items (monitored/disabled/not supported) 1694 1677 / 0 / 17
    Number of triggers (enabled/disabled)[true/unknown/false] 1163 1163 / 0 [5 / 810 / 348]
    Required server performance, new values per second 19.013055555556

    You'd think a server like this could handle just about anything you could throw at it, but somehow that's not the case with Zabbix Server 1.8 and Oracle 10.2.0.1 on the backend.

    Preliminary data shows that we can sustain up to around 12 values per second before the system decides to simply not appear to update any more. After about 15 to 30 minutes, my QUEUE is pegged at 1100 for Zabbix Agent and 111 for Simple Checks in the "Longer than 10 minutes" column. I am overwhelmed by false positives on a huge number of triggers simply because the data which is coming in is not getting recorded or something. Even with debuglevel at 4 I'm only seeing SUCCESS for data incoming. It just doesn't seem to be getting in fast enough. Here's an example:
    Code:
     23266:20091228:220143.479 Get value from agent result: '2335078905'
     23266:20091228:220143.479 End of get_value():SUCCEED
     23266:20091228:220143.479 In calculate_item_nextcheck (1663,60,"",1262059303)
     23266:20091228:220143.479 End calculate_item_nextcheck (result:1262059363)
     23266:20091228:220143.479 In substitute_simple_macros (data:'vm.memory.size[free]')
     23266:20091228:220143.479 In get_value() key:'vm.memory.size[free]'
     23266:20091228:220143.480 In get_value_agent() host:'[I][B]REDACTED[/B][/I]' addr:'[I][B]REDACTED[/B][/I]' key:'vm.memory.size[free]'
     23266:20091228:220143.481 Sending [vm.memory.size[free]
    I am also flummoxed by how long it's taking to restart zabbix. First it takes forever to write the values out to the database and then it takes forever to start again after:

    Code:
     22157:20091228:213915.653 tr value [0] event_prev_value [2] event_last_status [0] new_value [2]
     22157:20091228:213915.653 Updating trigger
     22157:20091228:213915.653 Query [txnlev:1] [update triggers set value=2,lastchange=1262056518,error='Zabbix was restarted.' where triggerid=13645]
    All of the bajillion similar entries go by at the speed of slow. Is it really updating every trigger with "Zabbix was restarted"?

    We did NOT have any problems like this with 1.6.x. What gives? OCI was supposed to be faster than libsqlora8. It doesn't seem that way to me.

    I guess I want to know that there is a) a way to improve Oracle performance, b) a magical configuration setting I'm missing, or c) something else?
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    Zabbix 1.8 is supposed to run much faster than 1.6.x. The after-restart trigger update logic is exactly the same as in 1.6. The 19 values per second can be easily handled by an embedded hardware, your box is capable of monitoring 50x more hosts, items and triggers (provided disk IO is fast).
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • untergeek
      Senior Member
      Zabbix Certified Specialist
      • Jun 2009
      • 512

      #3
      Thanks for the reply, Alexei. I just want to know why the discrepancy exists. We were handling 50 values per second with 1.6.8 and the server wasn't breaking a sweat. Why in 1.8 am I suffering? OCI vs. libsqlora8 can't explain this, can it? It was bound to the lib32 oracle libraries and so is the OCI 1.8 server. I just don't get it. I will paste in a screen capture of my queue so you can see how backed up it all is (I'm replying from my iPhone right now so I can't).

      Comment

      • Alexei
        Founder, CEO
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Sep 2004
        • 5654

        #4
        Yes, OCI is supposed to work faster. I cannot answer your question without seeing more details.
        Alexei Vladishev
        Creator of Zabbix, Product manager
        New York | Tokyo | Riga
        My Twitter

        Comment

        • untergeek
          Senior Member
          Zabbix Certified Specialist
          • Jun 2009
          • 512

          #5
          What details would you like? I will harvest anything I can get from the debug log.

          Comment

          • untergeek
            Senior Member
            Zabbix Certified Specialist
            • Jun 2009
            • 512

            #6
            Here's the screen cap of my queue (attached).
            Attached Files

            Comment

            • untergeek
              Senior Member
              Zabbix Certified Specialist
              • Jun 2009
              • 512

              #7
              Configuration:
              Code:
              Configuration:
              
                Detected OS:           solaris2.10
                Install path:          /usr/local
                Compilation arch:      solaris
              
                Compiler:              /usr/bin/cc
                Compiler flags:        -I/usr/local/include -I/opt/sfw/include -I/opt/oracle/product/10.2.0.1/rdbms/public -I/opt/oracle/product/10.2.0.1/rdbms/demo       -I/usr/sfw/include -I/usr/local/include -I. -I/usr/local/include    -I/usr/local/include 
              
                Enable server:         yes
                With database:         Oracle
                WEB Monitoring via:    cURL
                Native Jabber:         no
                SNMP:                  net-snmp
                IPMI:                  no
                Linker flags:          -L/usr/local/include -L/opt/oracle/product/10.2.0.1/lib32 -R/usr/local/include -R/opt/oracle/product/10.2.0.1/lib32  -L/usr/local/lib   -L/opt/oracle/product/10.2.0.1/lib      -L/opt/sfw/lib -lcurl -L/usr/sfw/lib -lssl -lcrypto -lsocket -lnsl -lssl -lcrypto -lsocket -lnsl -ldl -lz  -L/usr/local/lib -L/usr/sfw/lib -L/usr/local/lib -lnetsnmp -lgen -lelf -lnsl -lsocket -lcrypto  -L/usr/local/lib -L/usr/sfw/lib -L/usr/local/lib -lnetsnmp -lgen -lelf -lnsl -lsocket -lcrypto 
                Libraries:             -lkvm -lm -lnsl -lkstat -lsocket  -lresolv -liconv  -lclntsh -lnnz10     -lcurl  -lnetsnmp  
              
                Enable proxy:          yes
                With database:         Oracle
                WEB Monitoring via:    cURL
                SNMP:                  net-snmp
                IPMI:                  no
                Linker flags:          -L/usr/local/include -L/opt/oracle/product/10.2.0.1/lib32 -R/usr/local/include -R/opt/oracle/product/10.2.0.1/lib32  -L/usr/local/lib   -L/opt/oracle/product/10.2.0.1/lib     -L/opt/sfw/lib -lcurl -L/usr/sfw/lib -lssl -lcrypto -lsocket -lnsl -lssl -lcrypto -lsocket -lnsl -ldl -lz  -L/usr/local/lib -L/usr/sfw/lib -L/usr/local/lib -lnetsnmp -lgen -lelf -lnsl -lsocket -lcrypto  -L/usr/local/lib -L/usr/sfw/lib -L/usr/local/lib -lnetsnmp -lgen -lelf -lnsl -lsocket -lcrypto 
                Libraries:             -lkvm -lm -lnsl -lkstat -lsocket  -lresolv -liconv  -lclntsh -lnnz10    -lcurl  -lnetsnmp  
              
                Enable agent:          yes
                Linker flags:          -L/usr/local/include -L/opt/oracle/product/10.2.0.1/lib32 -R/usr/local/include -R/opt/oracle/product/10.2.0.1/lib32  -L/usr/local/lib 
                Libraries:             -lkvm -lm -lnsl -lkstat -lsocket  -lresolv -liconv
              
                LDAP support:          no
                IPv6 support:          no
              We're not using any proxy for monitoring. All hosts are directly reachable by the Zabbix Server. I merely compiled it in case we wanted it in the future.
              Last edited by untergeek; 29-12-2009, 16:33. Reason: Appending comments

              Comment

              • untergeek
                Senior Member
                Zabbix Certified Specialist
                • Jun 2009
                • 512

                #8
                Zabbix Shutdown time

                How long should it take to shut down a Zabbix 1.8 server?

                I understand that it is performing history syncing. How long should it take for this to complete?

                Here's how long it takes for the above server:
                Code:
                  2945:20091229:092625.677 One child process died (PID:3236). Exiting ...
                  2945:20091229:092627.769 Syncing history data...
                  2945:20091229:092923.772 Syncing history data...3.637686%
                  2945:20091229:093310.670 Syncing history data...7.275373%
                  2945:20091229:093705.617 Syncing history data...10.913059%
                  2945:20091229:093939.803 Syncing history data...14.550746%
                  2945:20091229:094218.742 Syncing history data...18.188432%
                  2945:20091229:094613.130 Syncing history data...21.826119%
                  2945:20091229:095004.805 Syncing history data...25.463805%
                  2945:20091229:095250.587 Syncing history data...29.101491%
                  2945:20091229:095520.293 Syncing history data...32.739178%
                  2945:20091229:095912.743 Syncing history data...36.376864%
                I'm not even going to bother making you wait for the end of this. There are simply not enough items for this to take this long, are there?

                Comment

                • untergeek
                  Senior Member
                  Zabbix Certified Specialist
                  • Jun 2009
                  • 512

                  #9
                  Here's the complete story.
                  It took from 9:26AM until 10:30AM to sync history data.

                  This can't be right.

                  Code:
                    2945:20091229:092625.677 One child process died (PID:3236). Exiting ...
                    2945:20091229:092627.769 Syncing history data...
                    2945:20091229:092923.772 Syncing history data...3.637686%
                    2945:20091229:093310.670 Syncing history data...7.275373%
                    2945:20091229:093705.617 Syncing history data...10.913059%
                    2945:20091229:093939.803 Syncing history data...14.550746%
                    2945:20091229:094218.742 Syncing history data...18.188432%
                    2945:20091229:094613.130 Syncing history data...21.826119%
                    2945:20091229:095004.805 Syncing history data...25.463805%
                    2945:20091229:095250.587 Syncing history data...29.101491%
                    2945:20091229:095520.293 Syncing history data...32.739178%
                    2945:20091229:095912.743 Syncing history data...36.376864%
                    2945:20091229:100307.935 Syncing history data...40.014551%
                    2945:20091229:100541.387 Syncing history data...43.652237%
                    2945:20091229:100800.715 Syncing history data...47.289924%
                    2945:20091229:101212.558 Syncing history data...50.927610%
                    2945:20091229:101614.931 Syncing history data...54.565296%
                    2945:20091229:101933.392 Syncing history data...58.202983%
                    2945:20091229:102131.802 Syncing history data...61.840669%
                    2945:20091229:102508.659 Syncing history data...65.478356%
                    2945:20091229:102854.285 Syncing history data...69.116042%
                    2945:20091229:102904.472 Syncing history data...70.163696%
                    2945:20091229:102915.612 Syncing history data...75.489269%
                    2945:20091229:102926.746 Syncing history data...79.927246%
                    2945:20091229:102936.797 Syncing history data...83.477628%
                    2945:20091229:102946.252 Syncing history data...87.519098%
                    2945:20091229:102956.350 Syncing history data...92.204438%
                    2945:20091229:103006.552 Syncing history data...97.224445%
                    2945:20091229:103012.663 Syncing history data...done.
                    2945:20091229:103012.664 Syncing trends data...
                    2945:20091229:103018.604 Syncing trends data...done.
                    2945:20091229:103018.606 Zabbix Server stopped.

                  Comment

                  • chivo
                    Junior Member
                    • Mar 2009
                    • 11

                    #10
                    Is this the same database and hardware you used with Zabbix 1.6.x? If you upgraded, did you follow the upgrade procedure for the database changes? Primarily I'm thinking about having to drop specific indexes and creating new ones.

                    Second, is this Oracle database used for any other applications? What is the back end storage like? Even with 1.6.X you should be able to handle much more than 50 new values per second. Using an HP intel server with similar memory requirements, I can run the zabbix server and mysql DB and have 538 hosts with 56496 items checked, and I'm not really pushing the system. (That's about 260 new values per second)

                    Given that, I would check that your Oracle configuration is optimized and disk IO for your DB is good.

                    Comment

                    • untergeek
                      Senior Member
                      Zabbix Certified Specialist
                      • Jun 2009
                      • 512

                      #11
                      It is the same hardware and Oracle server. We even started over from scratch with a clean schema for 1.8, just to be sure. It's not from having plugged-up indexes.

                      Comment

                      • ericgearhart
                        Senior Member
                        • Jan 2009
                        • 115

                        #12
                        Hmmm really feels like possible bug with Oracle driver to me, especially if it was changed from 1.6 -> 1.8

                        If I had an extra Oracle box laying around I'd try to help...unfortunately my experience in DBA related things is limited to MySQL and Postgres (and a tiny little MS SQL)

                        Comment

                        • untergeek
                          Senior Member
                          Zabbix Certified Specialist
                          • Jun 2009
                          • 512

                          #13
                          They say they've tried with Oracle 11g. Can anyone running Oracle 11g on Solaris confirm that they are not experiencing this problem?

                          Comment

                          • ericgearhart
                            Senior Member
                            • Jan 2009
                            • 115

                            #14
                            There's a "free" (as in beer) Oracle edition similar to MS SQL Express, aptly titled "Oracle Express Edition"

                            "Express Edition[40] ('Oracle Database XE'), introduced in 2005, offers Oracle 10g free to distribute on Windows and Linux platforms. It has a footprint of only 150 MB and is restricted to the use of a single CPU, a maximum of 4 GB of user data. Although it can install on a server with any amount of memory, it uses a maximum of 1 GB.[41] Support for this version comes exclusively through on-line forums and not through Oracle support." (from the WP article)

                            ... I don't see an Express Edition for 11g though. If there was an 11g Express version I'd be tempted to throw up an 11g Express Edition Linux VM and test this...

                            Comment

                            • untergeek
                              Senior Member
                              Zabbix Certified Specialist
                              • Jun 2009
                              • 512

                              #15
                              Thanks for the thought. Yeah, we're running full enterprise Oracle. I'd probably need an apples to apples comparison. I might be able to convince my boss to run an 11g install somewhere on our cluster.

                              Comment

                              Working...