Ad Widget

Collapse

%CPU is 99 and some monitoring data is not collected

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Robert Wagnon
    Member
    • Jan 2008
    • 47

    #1

    %CPU is 99 and some monitoring data is not collected

    We have found that one of the zabbix_server processes will go from somewhere near 4% CPU utilization to 99% CPU utilization. While this is a concern, we ignored this situation until we noticed that the system stops collecting some of our Host Item data. Many of the Host Item data continues, but some just stop. This is most evident when we look at a Last Week Graph of something like temperature.


    I have pasted information about the zabbix_server version, top, strace, and zabbix_server.log below. Any help is greatly appreciated!


    Here is the output from zabbix_server --version:

    administrator@MONITORING1:~$ zabbix_server --version
    ZABBIX Server (daemon) v1.4.5 (25 March 2008)
    Compilation time: Apr 21 2008 15:49:43


    Here is the output from the Linux top command:

    top - 06:00:32 up 2 days, 8:38, 1 user, load average: 2.24, 2.03, 2.05
    Tasks: 98 total, 4 running, 94 sleeping, 0 stopped, 0 zombie
    Cpu(s): 3.8%us, 41.8%sy, 16.5%ni, 17.2%id, 18.5%wa, 1.5%hi, 0.7%si, 0.0%st
    Mem: 2075908k total, 2021856k used, 54052k free, 126684k buffers
    Swap: 2947888k total, 96k used, 2947792k free, 1686916k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2068 zabbix 30 5 10576 2968 1852 R 98 0.1 461:00.16 zabbix_server
    3970 mysql 18 0 135m 47m 5328 S 19 2.4 694:52.93 mysqld
    2052 zabbix 20 5 12568 5176 1984 S 1 0.2 3:03.50 zabbix_server
    2053 zabbix 20 5 12496 5108 1984 S 1 0.2 2:48.90 zabbix_server
    2050 zabbix 22 5 12496 5120 1980 R 1 0.2 2:55.31 zabbix_server
    2054 zabbix 20 5 9904 1560 852 S 1 0.1 1:00.41 zabbix_server
    2049 zabbix 20 5 12568 5196 1984 S 0 0.3 3:13.47 zabbix_server
    2051 zabbix 20 5 12640 5256 1984 S 0 0.3 3:16.95 zabbix_server
    2056 zabbix 20 5 9904 1560 852 S 0 0.1 1:01.65 zabbix_server
    2058 zabbix 20 5 9904 1560 852 S 0 0.1 1:01.30 zabbix_server
    2076 zabbix 20 5 10008 2612 1820 S 0 0.1 0:08.03 zabbix_server
    4198 zabbix 20 5 4392 872 604 S 0 0.0 3:50.86 zabbix_agentd
    1 root 18 0 2948 1852 532 S 0 0.1 0:04.31 init


    Here is a sample of strace -p 2068 (the PID listed above):

    gettimeofday({1209899083, 753014}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 753969}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 754993}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 755837}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
    gettimeofday({1209899083, 756681}, NULL) = 0
    select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
    recvmsg(0, 0xbfe58bc8, 0) = -1 ENOTSOCK (Socket operation on non-socket)


    Here is a sample of zabbix_server.log:

    administrator@MONITORING1:~$ tail /tmp/zabbix_server.log
    2050:20080504:060636 Expression [{18698}>0] cannot be evaluated [Unable to get value for functionid [18698]]
    2050:20080504:060636 Expression [{15579}>0] cannot be evaluated [Unable to get value for functionid [15579]]
    2050:20080504:060636 Expression [{12566}>0] cannot be evaluated [Unable to get value for functionid [12566]]
    2051:20080504:060637 Expression [{12387}>0] cannot be evaluated [Unable to get value for functionid [12387]]
    2051:20080504:060637 Expression [{15400}>100] cannot be evaluated [Unable to get value for functionid [15400]]
    2051:20080504:060637 Expression [{12567}>150000] cannot be evaluated [Unable to get value for functionid [12567]]
    2051:20080504:060637 Expression [{15580}>0] cannot be evaluated [Unable to get value for functionid [15580]]
    2051:20080504:060637 Expression [{13227}>0] cannot be evaluated [Unable to get value for functionid [13227]]
    2051:20080504:060638 Expression [{18642}>100] cannot be evaluated [Unable to get value for functionid [18642]]
    2051:20080504:060638 Expression [{13407}>150000] cannot be evaluated [Unable to get value for functionid [13407]]
  • Robert Wagnon
    Member
    • Jan 2008
    • 47

    #2
    Moved to Troubleshooting

    I've asked this question in Troubleshooting. Please respond in that Forum, thank you.

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      The issue is fixed in the very latest code of 1.4.x. Please wait for 1.4.6.
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • Robert Wagnon
        Member
        • Jan 2008
        • 47

        #4
        Apparent cause of problem

        The problem improved when I removed all SNMP hosts. I believe we have a SNMP device with a poorly written SNMP system. A Ricoh printer apparently generates SNMP responses that hang Zabbix. Since they surely purchased their SNMP sub-system from someone else, this is likely to affect other SNMP capable devices.

        Perhaps additional SNMP data analysis and error trapping would help solve the root cause.

        Comment

        • Robert Wagnon
          Member
          • Jan 2008
          • 47

          #5
          Alexis solution

          Thanks Alexei!

          I'm sure this will help us as we expand our monitoring base. We're going to touch all sorts of SNMP devices eventually.

          -----

          Sorry, the editor won't let me correct the title.

          Comment

          • mknowles
            Junior Member
            • Oct 2008
            • 1

            #6
            After a few hours I get high cpu usage on zabbix_server

            [root@localhost ~]# zabbix_server --version
            ZABBIX Server (daemon) v1.6 (18 September 2008)
            Compilation time: Oct 28 2008 16:41:06
            [root@localhost ~]#



            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 89530}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 90326}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 92333}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 93074}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 93812}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 94615}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 96152}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 96895}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 98773}, NULL) = 0
            select(1, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
            recvmsg(0, 0xbff6c0f8, 0) = -1 ENOTSOCK (Socket operation on non-socket)
            gettimeofday({1225345146, 99803}, NULL) = 0

            Comment

            • yurtesen
              Senior Member
              • Aug 2008
              • 130

              #7
              I am seeing exactly same issue on 1.6.1, could this have re-surfaced somehow?

              Comment

              • Robert Wagnon
                Member
                • Jan 2008
                • 47

                #8
                Problem reappeared

                This re-appeared for us. We are not monitoring the known incompatible device (Ricoh printer). We did the following in one step and the problem disappeared:

                Monitored 25% less stuff. This isn't really a good option to recommend.
                Added more memory to the machine.
                Moved to 3 disk RAID5 storage.
                Upgraded to 1.6.2.

                Personally, I expect this to reappear as we continue to scale out. There isn't sufficient debugging support to identify the problem.

                We are very appreciative of Zabbix. I wish we could help contribute more to the resolution of this problem.
                Last edited by Robert Wagnon; 27-01-2009, 18:53.

                Comment

                • yurtesen
                  Senior Member
                  • Aug 2008
                  • 130

                  #9
                  Originally posted by Robert Wagnon
                  Personally, I expect this to reappear as we continue to scale out. There isn't sufficient debugging support to identify the problem.
                  If the developers of zabbix tell me what information they need, I can happily provide it to them. So far the problem occurs quite randomly as far as I can see... it didnt re-appear since my previous post at least.

                  Comment

                  Working...