Ad Widget

Collapse

%CPU is 99. Many Host Items are not collected.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Robert Wagnon
    Member
    • Jan 2008
    • 47

    #1

    %CPU is 99. Many Host Items are not collected.

    We have found that one of the zabbix_server processes will go from somewhere near 4% CPU utilization to 99% CPU utilization. While this is a concern, we ignored this situation until we noticed that the system stops collecting some of our Host Item data. Many of the Host Item data continues, but some just stop. This is most evident when we look at a Last Week Graph of something like temperature.


    I have pasted information about the zabbix_server version, top, strace, and zabbix_server.log below. Any help is greatly appreciated!


    Here is the output from zabbix_server --version:

    administrator@MONITORING1:~$ zabbix_server --version
    ZABBIX Server (daemon) v1.4.5 (25 March 2008)
    Compilation time: Apr 21 2008 15:49:43


    Here is the output from the Linux top command:

    top - 06:00:32 up 2 days, 8:38, 1 user, load average: 2.24, 2.03, 2.05
    Tasks: 98 total, 4 running, 94 sleeping, 0 stopped, 0 zombie
    Cpu(s): 3.8%us, 41.8%sy, 16.5%ni, 17.2%id, 18.5%wa, 1.5%hi, 0.7%si, 0.0%st
    Mem: 2075908k total, 2021856k used, 54052k free, 126684k buffers
    Swap: 2947888k total, 96k used, 2947792k free, 1686916k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2068 zabbix 30 5 10576 2968 1852 R 98 0.1 461:00.16 zabbix_server
    3970 mysql 18 0 135m 47m 5328 S 19 2.4 694:52.93 mysqld
    2052 zabbix 20 5 12568 5176 1984 S 1 0.2 3:03.50 zabbix_server
    2053 zabbix 20 5 12496 5108 1984 S 1 0.2 2:48.90 zabbix_server
    2050 zabbix 22 5 12496 5120 1980 R 1 0.2 2:55.31 zabbix_server
    2054 zabbix 20 5 9904 1560 852 S 1 0.1 1:00.41 zabbix_server
    2049 zabbix 20 5 12568 5196 1984 S 0 0.3 3:13.47 zabbix_server
    2051 zabbix 20 5 12640 5256 1984 S 0 0.3 3:16.95 zabbix_server
    2056 zabbix 20 5 9904 1560 852 S 0 0.1 1:01.65 zabbix_server
    2058 zabbix 20 5 9904 1560 852 S 0 0.1 1:01.30 zabbix_server
    2076 zabbix 20 5 10008 2612 1820 S 0 0.1 0:08.03 zabbix_server
    4198 zabbix 20 5 4392 872 604 S 0 0.0 3:50.86 zabbix_agentd
    1 root 18 0 2948 1852 532 S 0 0.1 0:04.31 init


    Here is a sample of strace -p 2068 (the PID listed above):

    gettimeofday({1209899083, 753014}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 753969}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 754993}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 755837}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 756681}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)


    Here is a sample of zabbix_server.log:

    administrator@MONITORING1:~$ tail /tmp/zabbix_server.log
    2050:20080504:060636 Expression [{18698}>0] cannot be evaluated [Unable to get value for functionid [18698]]
    2050:20080504:060636 Expression [{15579}>0] cannot be evaluated [Unable to get value for functionid [15579]]
    2050:20080504:060636 Expression [{12566}>0] cannot be evaluated [Unable to get value for functionid [12566]]
    2051:20080504:060637 Expression [{12387}>0] cannot be evaluated [Unable to get value for functionid [12387]]
    2051:20080504:060637 Expression [{15400}>100] cannot be evaluated [Unable to get value for functionid [15400]]
    2051:20080504:060637 Expression [{12567}>150000] cannot be evaluated [Unable to get value for functionid [12567]]
    2051:20080504:060637 Expression [{15580}>0] cannot be evaluated [Unable to get value for functionid [15580]]
    2051:20080504:060637 Expression [{13227}>0] cannot be evaluated [Unable to get value for functionid [13227]]
    2051:20080504:060638 Expression [{18642}>100] cannot be evaluated [Unable to get value for functionid [18642]]
    2051:20080504:060638 Expression [{13407}>150000] cannot be evaluated [Unable to get value for functionid [13407]]
  • Robert Wagnon
    Member
    • Jan 2008
    • 47

    #2
    Tried setting StartTrappers=50

    Also tried 25, but it didn't help.

    Comment

    • Robert Wagnon
      Member
      • Jan 2008
      • 47

      #3
      md5sum on md5sum zabbix-1.4.5.tar.gz

      I understand that a bad 1.4.5 download might cause trouble. Here is our md5sum:

      md5sum zabbix-1.4.5.tar.gz
      f87d73852fdab33f99beebfd16c21c63 zabbix-1.4.5.tar.gz

      Comment

      • Robert Wagnon
        Member
        • Jan 2008
        • 47

        #4
        Occurs on multiple Zabbix Servers

        We see this same behavior on the following configurations:

        vmware 1.0.4 Ubuntu 7.10
        vmware 2.0 beta 2 Ubuntu 7.10
        vmware 2.0 beta 2 Ubuntu 8.04
        HP DL360 G3 Ubuntu 7.10
        HP DL360 G3 Ubuntu 8.04

        Comment

        • Robert Wagnon
          Member
          • Jan 2008
          • 47

          #5
          Trying nightly build

          I'm getting desperate, so I'm trying the nightly build to see if something has been fixed...

          1.4.5 Build 5717. No luck. Still problems with zabbix_server.
          Last edited by Robert Wagnon; 21-05-2008, 03:23.

          Comment

          • boy01
            Junior Member
            • Dec 2007
            • 24

            #6
            Originally posted by Robert Wagnon
            This is most evident when we look at a Last Week Graph of something like temperature.
            So, how do you get temperature from hosts running zabbix_agent?
            You can test this with zabbix_get from zabbix_server host.

            zabbix_get -s client-ip -p zabbix-port -k temp-item ?
            Fg. on your zabbix_server do:
            zabbix_get -s client-ip-not-getting-temp-from -k temp-metric
            Does it get temperature from your agents (not showing temp in zabbix)?

            I don't have "temp-metric" compiled in my 1.4.4 agents, so I
            can't say what that last parameter is exactly. You should find it from
            your Configuration/Items page.
            Oops, it's there: sensor[temp1] (or temp2, temp3).
            So, "zabbix_get -s client-ip -k sensor[temp1]" should work, but I believe
            you also need support for that on your host (fg. configured lm_sensors package?).

            Originally posted by Robert Wagnon
            Here is a sample of zabbix_server.log:

            administrator@MONITORING1:~$ tail /tmp/zabbix_server.log
            2050:20080504:060636 Expression [{18698}>0] cannot be evaluated [Unable to get value for functionid [18698]]
            2050:20080504:060636 Expression [{15579}>0] cannot be evaluated [Unable to get value for functionid [15579]]
            ...
            You could check:
            select * from functions where functionid = 18698; /* to get itemid */
            select * from items where itemid = "itemid-from-above";
            Look at hostid, description and key_.
            select * from hosts where hostid = "hostid-from-above";

            Now you should know what function isn't available to trigger evaluation
            on which host. These are quite normal, imho. Because not all items are
            available on every hosts (fg. sendmail isn't running on every host).
            Last edited by boy01; 21-05-2008, 09:40. Reason: sensor item

            Comment

            • Robert Wagnon
              Member
              • Jan 2008
              • 47

              #7
              Temperature

              Although it isn't really the topic of this post, I get temperature from special equipment like Dell servers by checking the Dell MIB specified OID.

              Comment

              • Robert Wagnon
                Member
                • Jan 2008
                • 47

                #8
                Great info!

                Thank you for your suggestions. These are all great.

                I've gone through and Disabled any "Not Supported" Items. Next, I'll disable any affected Triggers. I'm doing this to try to reduce known response failures that might cause a bug in zabbix to lockup the zabbix_server process.

                I performed the SQL like you suggested (which worked perfectly), but it appears that there is data for all the Triggers that fail due to lack of Item data. I don't understand that. I'm hoping my efforts described above might reduce these errors. (On the off chance that I can bypass the zabbix bug I'm experiencing.)

                Comment

                • boy01
                  Junior Member
                  • Dec 2007
                  • 24

                  #9
                  Originally posted by Robert Wagnon
                  Although it isn't really the topic of this post, I get temperature from special equipment like Dell servers by checking the Dell MIB specified OID.
                  Hmm...so, how does zabbix_server get this info?
                  Do you use UserParameter= in zabbix_agentd.conf to get it or
                  does zabbix_server host collect (how?) temp info remotely?
                  Did you try zabbix_get to verify temp item is really available
                  (if zabbix_agent is been used)?

                  I think there have been some problems w/ UserParameter usage,
                  but I really don't know the current status.

                  Comment

                  • Robert Wagnon
                    Member
                    • Jan 2008
                    • 47

                    #10
                    Temperature

                    The Dell OpenManage software will respond to SNMP requests. They answer the special Dell question "What is your temperature" (OID .1.3.6.1.4.1.674.10892.1.700.20.1.6.1.5)

                    It is basic SNMP. No UserParameters involved.

                    Comment

                    • boy01
                      Junior Member
                      • Dec 2007
                      • 24

                      #11
                      So, my question is: how does zabbix_server get that info?
                      Ie. how should it get to zabbix database...

                      Basic SNMP gets won't update the temp data in zabbix DB.

                      EDIT: Ups, sorry for my ignorance. I've never used SNMP OID items!
                      Just RTFM... You _can_ use snmp OIDs in items. I'm learning here, too.
                      Last edited by boy01; 22-05-2008, 14:10.

                      Comment

                      • Robert Wagnon
                        Member
                        • Jan 2008
                        • 47

                        #12
                        Next effort to fix...

                        The error logs were showing all those missing functions and it was clear from the "select *" queries that they were all SNMP related. So, I've disabled almost all SNMP (except 1 port I have to chart and temperatures.)

                        The log now shows far fewer entries. I hope this clears it up, then I'll start reactivating things one at a time...

                        Comment

                        • Robert Wagnon
                          Member
                          • Jan 2008
                          • 47

                          #13
                          Resolution

                          We determined that our SNMP capable Ricoh printer causes Zabbix to freak out. We've removed that device from polling and now everything works.

                          Comment

                          • boy01
                            Junior Member
                            • Dec 2007
                            • 24

                            #14
                            Originally posted by Robert Wagnon
                            We determined that our SNMP capable Ricoh printer causes Zabbix to freak out. We've removed that device from polling and now everything works.
                            What a strange fix! Zabbix should handle any snmp device.
                            Would be good to find out the actual cause from that device's
                            snmp data...

                            Comment

                            Working...