Ad Widget

Collapse

Zabbix generating high CPU/database load

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jbayer
    Junior Member
    • Aug 2010
    • 1

    #1

    Zabbix generating high CPU/database load

    Hi,

    I have a Zabbix 1.8.2 installation on a Centos 5.5 system.
    Database is PostgreSql


    I have it set up to monitor two websites, and to do discovery on a couple of networks.



    I came in this morning, and saw that Zabbix is causing the server to have a very high load, and very high IO. I turned on logging on the database, and see that Zabbix is generating between 5 and 20 queries a second. The IO on the database is causing the disk to have constant transfer of between 12 and 20 megabytes/second.

    I'm assuming that there is a parameter I can change, but I have no idea since I'm just learning it. I'm pasted a few entries from the PostgreSql log below.

    Any ideas?

    Thanks in advance


    JBB


    LOG: statement: select distinct t.triggerid,t.expression,t.description,t.url,t.com ments,t.status,t.value,t.priority,t.type,t.error,f .itemid from triggers t,functions f,items i where i.status not in (3) and i.itemid=f.itemid and t.status=0 and f.triggerid=t.triggerid and f.itemid in (22563,22461,22453,22454,22455,22516,22517,22518,2 2549,22519,22550,22520,22521,22431,22522)
    LOG: statement: select distinct i.itemid,i.key_,h.host,h.port,i.delay,i.descriptio n,i.type,h.useip,h.ip,i.history,i.lastvalue,i.prev value,i.hostid,i.value_type,i.delta,i.prevorgvalue ,i.lastclock,i.units,i.multiplier,i.formula,i.stat us,i.valuemapid,h.dns,i.trends,i.lastlogsize,i.dat a_type,i.mtime from hosts h,items i, functions f where h.hostid=i.hostid and h.status=0 and i.status=0 and f.function in ('nodata','date','dayofweek','time','now') and i.itemid=f.itemid and (h.maintenance_status=0 or h.maintenance_type=0) and h.hostid between 000000000000000 and 099999999999999
    LOG: statement: begin;
    LOG: statement: select dh.dhostid,dh.status,dh.lastup,dh.lastdown from dhosts dh,dservices ds where ds.dhostid=dh.dhostid and dh.druleid=2 and ds.ip='10.11.1.170' order by dh.dhostid
    LOG: statement: commit;
    LOG: statement: select t.httptestid,t.name,t.applicationid,t.nextcheck,t. status,t.delay,t.macros,t.agent,t.authentication,t .http_user,t.http_password from httptest t,applications a,hosts h where t.applicationid=a.applicationid and a.hostid=h.hostid and t.nextcheck<=1280769151 and mod(t.httptestid,1)=0 and t.status=0 and h.status=0 and (h.maintenance_status=0 or h.maintenance_type=0) and t.httptestid between 000000000000000 and 099999999999999
    LOG: statement: select escalationid,actionid,triggerid,eventid,r_eventid, esc_step,status from escalations where status in (0,1) and nextcheck<=1280769151 and escalationid between 000000000000000 and 099999999999999
    LOG: statement: select count(*),min(t.nextcheck) from httptest t,applications a,hosts h where t.applicationid=a.applicationid and a.hostid=h.hostid and mod(t.httptestid,1)=0 and t.status=0 and h.status=0 and (h.maintenance_status=0 or h.maintenance_type=0) and t.httptestid between 000000000000000 and 099999999999999
  • richlv
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Oct 2005
    • 3112

    #2
    5 to 20 queries per second should not be an issue.
    how did you measure disk usage rate ?
    how much of that was reads, how much - writes ?
    is that a virtual machine by any chance ?
    Zabbix 3.0 Network Monitoring book

    Comment

    • makini
      Member
      • Jul 2006
      • 59

      #3
      Similar issues with 1.8.3 after upgrade from 1.8.2

      Hi,

      After the upgrade to 1.8.3 from 1.8.2 we started experiencing similar spikes in CPU and IO load...

      Our setup is larger though:
      Number of hosts (monitored/not monitored/templates) 162 134 / 21 / 7
      Number of items (monitored/disabled/not supported) 4642 4345 / 292 / 5
      Number of triggers (enabled/disabled)[problem/unknown/ok] 3187 2940 / 247 [11 / 19 / 2910]
      Number of users (online) 38 2
      Required server performance, new values per second 35.81 -

      Database queries p/s is around 180 (it's on MySQL), most of those are in "Sleep" command state. The spikes in IO and CPU usage can be seen here:

      avg-cpu: %user %nice %system %iowait %steal %idle
      38.00 0.50 15.00 28.50 0.00 18.00
      Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
      sda 0.00 10.89 0.00 480.20 0.00 4356.44 9.07 117.69 246.82 2.07 99.31

      avg-cpu: %user %nice %system %iowait %steal %idle
      4.48 0.00 5.47 64.18 0.00 25.87
      Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
      sda 0.00 12.00 0.00 441.00 0.00 3800.00 8.62 131.26 287.71 2.27 100.20

      avg-cpu: %user %nice %system %iowait %steal %idle
      3.00 0.00 5.00 20.50 0.00 71.50
      Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
      sda 0.00 7.00 0.00 190.00 0.00 1552.00 8.17 58.88 172.49 2.33 44.30

      avg-cpu: %user %nice %system %iowait %steal %idle
      6.00 0.50 5.50 67.00 0.00 21.00
      Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
      sda 0.00 15.00 0.00 478.00 0.00 4456.00 9.32 124.70 246.28 2.10 100.20

      avg-cpu: %user %nice %system %iowait %steal %idle
      7.46 1.49 6.47 61.69 0.00 22.89
      Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
      sda 0.00 31.68 0.99 336.63 7.92 3057.43 9.08 94.00 351.18 2.65 89.60


      The 1.8.2 (release) version did not have such load causing spikes on the database...

      Comment

      • magawake
        Junior Member
        • Aug 2010
        • 24

        #4
        Same exact problem,


        LOG: statement: select distinct t.triggerid,t.expression,t.description,t.url,t.com ments,t.status,t.value,t.priority,t.type,t.error,f .itemid from triggers t,functions f,items i where i.status not in (3) and i.itemid=f.itemid and t.status=0 and f.triggerid=t.triggerid and f.itemid in (22563,22461,22453,22454,22455,22516,22517,22518,2 2549,22519,22550,22520,22521,22431,22522)
        LOG: statement: select distinct i.itemid,i.key_,h.host,h.port,i.delay,i.descriptio n,i.type,h.useip,h.ip,i.history,i.lastvalue,i.prev value,i.hostid,i.value_type,i.delta,i.prevorgvalue ,i.lastclock,i.units,i.multiplier,i.formula,i.stat us,i.valuemapid,h.dns,i.trends,i.lastlogsize,i.dat a_type,i.mtime from hosts h,items i, functions f where h.hostid=i.hostid and h.status=0 and i.status=0 and f.function in ('nodata','date','dayofweek','time','now') and i.itemid=f.itemid and (h.maintenance_status=0 or h.maintenance_type=0) and h.hostid between 000000000000000 and 099999999999999
        LOG: statement: begin;
        LOG: statement: select dh.dhostid,dh.status,dh.lastup,dh.lastdown from dhosts dh,dservices ds where ds.dhostid=dh.dhostid and dh.druleid=2 and ds.ip='10.11.1.170' order by dh.dhostid
        LOG: statement: commit;
        LOG: statement: select t.httptestid,t.name,t.applicationid,t.nextcheck,t. status,t.delay,t.macros,t.agent,t.authentication,t .http_user,t.http_password from httptest t,applications a,hosts h where t.applicationid=a.applicationid and a.hostid=h.hostid and t.nextcheck<=1280769151 and mod(t.httptestid,1)=0 and t.status=0 and h.status=0 and (h.maintenance_status=0 or h.maintenance_type=0) and t.httptestid between 000000000000000 and 099999999999999
        LOG: statement: select escalationid,actionid,triggerid,eventid,r_eventid, esc_step,status from escalations where status in (0,1) and nextcheck<=1280769151 and escalationid between 000000000000000 and 099999999999999
        LOG: statement: select count(*),min(t.nextcheck) from httptest t,applications a,hosts h where t.applicationid=a.applicationid and a.hostid=h.hostid and mod(t.httptestid,1)=0 and t.status=0 and h.status=0 and (h.maintenance_status=0 or h.maintenance_type=0) and t.httptestid between 000000000000000 and 099999999999999
        This query is knocking my DB server down.

        DB server has 32G of memory and 8 cores. I have like 4 of these selects runnings.
        Using all snmp with 400 hosts.

        I think we can optimize this query by indexing the proper fields...
        ________
        California dispensaries
        Last edited by magawake; 16-03-2011, 20:09.

        Comment

        • magawake
          Junior Member
          • Aug 2010
          • 24

          #5
          I fixed the problem by tuning postgresql.
          I followed this webpage, http://wiki.postgresql.org/wiki/Tuni...tgreSQL_Server

          Things are much faster now...
          ________
          X hamster
          Last edited by magawake; 16-03-2011, 20:10.

          Comment

          Working...