Ad Widget

Collapse

Zabbix server taking very long to restart ~ 1.5 to 2 hrs

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • prakhar
    Junior Member
    • Jan 2013
    • 8

    #1

    Zabbix server taking very long to restart ~ 1.5 to 2 hrs

    Hi,
    I am monitoring around 450 nodes, at starting everything was runnig fine.
    But zabbix UI gradually (in two weeks time) became very slow.
    I tried to restart zabbix server after changing some configuration params(increased trappers ans pollers) but zabbix took a long time to come up.
    Also the zabbix API response also takes very long.

    I have load tested zabbix for 2700 to 3000 nvps but i was generating load from 10 to 15 servers. Now nvps is around 700 but hosts are 450 and zabbix is not able to handle it.

    Setup details:
    Linux 2.6.32-358.2.1.el6
    Data base : Postgresql
    zabbix : Zabbix server v2.0.9 (revision 39085)
    Both Db and server on same machine.

    CPU cores : 24
    Ram : 96 GB

    **************************************
    top output

    top - 10:28:58 up 8 days, 23:03, 1 user, load average: 512.98, 607.09, 649.69
    Tasks: 3440 total, 59 running, 3381 sleeping, 0 stopped, 0 zombie
    Cpu(s): 93.8%us, 2.1%sy, 0.0%ni, 4.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Mem: 99022656k total, 97327092k used, 1695564k free, 426468k buffers
    Swap: 20972848k total, 78320k used, 20894528k free, 67705676k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    14744 postgres 20 0 10.0g 537m 530m S 2.9 0.6 22:08.43 postmaster
    14973 postgres 20 0 10.0g 541m 532m S 2.7 0.6 21:57.05 postmaster
    54173 admin 20 0 17692 3988 1004 R 2.7 0.0 0:00.52 top
    14059 postgres 20 0 10.0g 540m 532m S 2.5 0.6 21:59.06 postmaster
    14175 postgres 20 0 10.0g 539m 531m S 2.5 0.6 21:57.19 postmaster
    14248 postgres 20 0 10.0g 540m 533m S 2.5 0.6 22:06.66 postmaster
    14251 postgres 20 0 10.0g 309m 304m S 2.5 0.3 21:51.41 postmaster
    14296 postgres 20 0 10.0g 321m 315m S 2.5 0.3 21:50.79 postmaster
    14316 postgres 20 0 10.0g 539m 530m S 2.5 0.6 22:00.16 postmaster
    14510 postgres 20 0 10.0g 537m 530m S 2.5 0.6 22:19.80 postmaster


    **************************************
    These are from zabbix_server.log on restart.

    30302:20150120:083102.798 query [txnlev:0] [select alert_history,event_history,refresh_unsupported,di scovery_groupid,snmptrap_logging,severity_name_0,s everity_name_1,severity_name_2,severity_name_3,sev erity_name_4,severity_name_5 from config where 1=1 and configid between 0 and 99999999999999]
    30302:20150120:083102.798 query [txnlev:0] [select i.itemid,i.hostid,h.proxy_hostid,i.type,i.data_typ e,i.value_type,i.key_,i.snmp_community,i.snmp_oid, i.port,i.snmpv3_securityname,i.snmpv3_securityleve l,i.snmpv3_authpassphrase,i.snmpv3_privpassphrase, i.ipmi_sensor,i.delay,i.delay_flex,i.trapper_hosts ,i.logtimefmt,i.params,i.status,i.authtype,i.usern ame,i.password,i.publickey,i.privatekey,i.flags,i. interfaceid,i.lastclock from items i,hosts h where i.hostid=h.hostid and h.status in (0) and i.status in (0,3) and i.itemid between 0 and 99999999999999]
    30302:20150120:083114.923 query [txnlev:0] [select distinct t.triggerid,t.description,t.expression,t.error,t.p riority,t.type,t.value,t.value_flags from hosts h,items i,functions f,triggers t where h.hostid=i.hostid and i.itemid=f.itemid and f.triggerid=t.triggerid and h.status in (0) and i.status in (0,3) and t.status in (0) and t.flags not in (2) and h.hostid between 0 and 99999999999999]

    Zabbix server runs some queries each time it restarts. And it takes very long for this query to execute.
    30302:20150120:083114.923 query [txnlev:0] [select distinct t.triggerid,t.description,t.expression,t.error,t.p riority,t.type,t.value,t.value_flags from hosts h,items i,functions f,triggers t where h.hostid=i.hostid and i.itemid=f.itemid and f.triggerid=t.triggerid and h.status in (0) and i.status in (0,3) and t.status in (0) and t.flags not in (2) and h.hostid between 0 and 99999999999999]



    Why has the DB queries gone so slow? and how can i avoid such a sitaution.
  • Colttt
    Senior Member
    Zabbix Certified Specialist
    • Mar 2009
    • 878

    #2
    maybe its the houskeeper process?

    do you tune you zabbinx_server.conf and postgressql-config?
    Debian-User

    Sorry for my bad english

    Comment

    • kloczek
      Senior Member
      • Jun 2006
      • 1771

      #3
      Originally posted by prakhar
      top - 10:28:58 up 8 days, 23:03, 1 user, load average: 512.98, 607.09, 649.69
      Tasks: 3440 total, 59 running, 3381 sleeping, 0 stopped, 0 zombie
      Cpu(s): 93.8%us, 2.1%sy, 0.0%ni, 4.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
      Mem: 99022656k total, 97327092k used, 1695564k free, 426468k buffers
      Swap: 20972848k total, 78320k used, 20894528k free, 67705676k cached
      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      14744 postgres 20 0 10.0g 537m 530m S 2.9 0.6 22:08.43 postmaster
      14973 postgres 20 0 10.0g 541m 532m S 2.7 0.6 21:57.05 postmaster
      54173 admin 20 0 17692 3988 1004 R 2.7 0.0 0:00.52 top
      14059 postgres 20 0 10.0g 540m 532m S 2.5 0.6 21:59.06 postmaster
      14175 postgres 20 0 10.0g 539m 531m S 2.5 0.6 21:57.19 postmaster
      14248 postgres 20 0 10.0g 540m 533m S 2.5 0.6 22:06.66 postmaster
      14251 postgres 20 0 10.0g 309m 304m S 2.5 0.3 21:51.41 postmaster
      14296 postgres 20 0 10.0g 321m 315m S 2.5 0.3 21:50.79 postmaster
      14316 postgres 20 0 10.0g 539m 530m S 2.5 0.6 22:00.16 postmaster
      14510 postgres 20 0 10.0g 537m 530m S 2.5 0.6 22:19.80 postmaster
      96GB RAM, 10GB for postgresql, only 6-7GB left for buffer cache and 500-600 proccesses/threads in running queue. Are you sure that on this host is running only postgresql?
      Zabbix srv is on the same host?
      If yes probably most of your items are passive items (and it is first bell to start moving away from passive monitoring) and so long running queue is caused by 500-600 pollers waiting on receive data from monitored hosts. Isn't it? What is your StartPollers value?
      Are you using partitioned history*/trends* tables?

      With about 450nvps daily volume of new data should around only 2-6GB of new data .. 96GB RAM it is in this case overkill (but it is not an issue).

      You must have few configuration issues overlapping causing such pathological results.
      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment

      • prakhar
        Junior Member
        • Jan 2013
        • 8

        #4
        Yes the setup is running both postgres and zabbix_server.
        10.0g is VIRT memory for each postmaster process which includes
        /*postgresql.conf*/
        shared_buffers = 9832MB.
        the actual RES memory per process is ~ 450MB.

        Most of the items are zabbix_active in my case.
        The differences in my load setup (running at ~3000nvps) and this setup are:
        1. This setup has ~120000 monitored items of which ~70000 are zabbix trapper items.
        2. In 3000nvps setup i had only 15 hosts but here i have around 450 hosts.
        3. In 3000nvps nodes i had no zabbix Trapper items but in this setup i have around 70k zabbix trapper items.

        Doubt : Though zabbix trapper items are not polled, is there a possibility that they increase zabbix table sizes resulting in slower queries?

        I have not partitioned history*/trends* tables.

        /*zabbix_server.conf*/
        StartTrappers=40
        StartPollers=40
        All the cache setting are to max allowed.


        @ Colttt : i have disabled housekeeper process.
        Ya i have tuned zabbix and posgtres for perf. I can provide other configuration details of postgresql and zabbix_server if you want.

        Comment

        • Colttt
          Senior Member
          Zabbix Certified Specialist
          • Mar 2009
          • 878

          #5
          how many db-syncers do you have?
          Debian-User

          Sorry for my bad english

          Comment

          • prakhar
            Junior Member
            • Jan 2013
            • 8

            #6
            StartDBSyncers=16

            Comment

            • kloczek
              Senior Member
              • Jun 2006
              • 1771

              #7
              Originally posted by prakhar
              Yes the setup is running both postgres and zabbix_server.
              10.0g is VIRT memory for each postmaster process which includes
              /*postgresql.conf*/
              shared_buffers = 9832MB.
              the actual RES memory per process is ~ 450MB.
              Seems it is huuuuge waste of memory.
              I have now my DB backend on Solaris on zfs for zabbix with 1.5knvps on host with only 32GB memory where at the moment only
              Code:
              $ kstat zfs:0:arcstats:size | grep size | awk '{printf "%2dMB\n",  $2/1024/1024+0.5}'
              10102MB
              ARC (zfs Adaptive Reclaim Cache) and still I have about 10GB not used memory.
              On this host all volumes are with max recordsize (1MB) and lzjb compression I have way less now IOs than I had on Linux.
              Code:
              $ zpool iostat 2
                             capacity     operations    bandwidth
              pool        alloc   free   read  write   read  write
              ----------  -----  -----  -----  -----  -----  -----
              rpool        102G  83.7G      2    143   624K  24.6M
              rpool        102G  83.7G      7    121  60.7K  17.4M
              rpool        102G  83.8G      0    216    767  27.2M
              rpool        102G  83.8G      0    156      0  21.4M
              rpool        102G  83.8G      2    476   642K   111M
              rpool        102G  83.8G      0    188      0  18.3M
              rpool        102G  83.8G      0    128  16.7K  18.2M
              rpool        102G  83.7G      0    179    255  19.5M
              rpool        102G  83.7G      0    328  16.7K  51.8M
              rpool        102G  83.7G      0    412      0  78.7M
              rpool        102G  83.7G      0    162  3.50K  21.7M
              rpool        102G  83.7G      2    245  1.47M  36.5M
              ^C
              in zpool used by mysql is only one pair of SSDs. On Linux the same mysql 5.5 been doing about 1.2-1.7kIO/s and this is why it was necessary to move to SSDs. Effectively after migrating to Solaris should be possible back to work on old spindles :P

              Code:
              # zfs get compression,recordsize,compressratio,referenced rpool/VARSHARE/mysql
              NAME                  PROPERTY       VALUE  SOURCE
              rpool/VARSHARE/mysql  compression    lzjb   local
              rpool/VARSHARE/mysql  recordsize     1M     local
              rpool/VARSHARE/mysql  compressratio  2.68x  -
              rpool/VARSHARE/mysql  referenced     89.1G  -
              Giving more cache for DB backend than you are daily storing data is usually wrong (with partitioned history* tables rotated every day it is easy to find how much is needed here).
              I found that zfs ARC is working better than mysql innodb cache so mysql has only innodb_buffer_pool_size=5GB.
              This host even doing on the fly compression/decompression with only running mysql has only 8-14% CPU time usage (on next promote slave DB to master I'm going to start experimenting with gzip compression) which is (strange) lower by about 10% than the same hardware mysql 5.5 been running on Linux.

              Most of the items are zabbix_active in my case.
              The differences in my load setup (running at ~3000nvps) and this setup are:
              1. This setup has ~120000 monitored items of which ~70000 are zabbix trapper items.
              2. In 3000nvps setup i had only 15 hosts but here i have around 450 hosts.
              3. In 3000nvps nodes i had no zabbix Trapper items but in this setup i have around 70k zabbix trapper items.

              Doubt : Though zabbix trapper items are not polled, is there a possibility that they increase zabbix table sizes resulting in slower queries?
              You can treat these items almost like active items. Why? Because passive items are causing that poller thread is connecting to the monitored host and it waits until requested monitoring data will sampled and send back in reply (which sometimes may take even couple of seconds).
              Using active items monitoring and trapper monitoring are causing that prx or srv threads are moving to running queue only to establish connectivity and instantly receive monitoring data and move back to poll of threads waiting to be used again.

              I have not partitioned history*/trends* tables.
              So seems now it is your biggest problem and it should be your top priority on your ToDo list

              /*zabbix_server.conf*/
              StartTrappers=40
              StartPollers=40
              All the cache setting are to max allowed.
              I have no passive items and more than 99% of all items are monitored over prxies so my srv settings are:

              Code:
              # cat /etc/zabbix/zabbix_server/Start*
              StartDBSyncers=10
              StartDiscoverers=1
              StartHTTPPollers=1
              StartPingers=1
              StartPollers=1
              StartProxyPollers=15
              StartTrappers=1
              For example biggest prx settings monitoring almost half of my hosts) are:
              Code:
              # cat /etc/zabbix/zabbix_proxy/Start*
              StartHTTPPollers=5
              StartPingers=30
              StartPollers=10
              StartTrappers=100
              With almost all items monitored over prxies (except couple of internal checks and few other items) I have no stress on restart server even if it is necessary to schedule a little longer srv downtime. All monitoring data are still in such architecture collected by prxies. As well I found that even running on the same host srv and prx with monitoring all items over proxy reduces IOs pressure .. only because srv always is digesting and storing monitoring data in bigger batches than the same hosts are straight monitored by srv.
              My biggest prx which monitors almost half of hosts and srv settings:

              Code:
              # cat /etc/zabbix/zabbix_server/ProxyDataFrequency; cat /etc/zabbix/zabbix_proxy/DataSenderFrequency
              ProxyDataFrequency=10
              DataSenderFrequency=10
              (I have mixture of passive and active proxies)

              prx and srv can work on one physical host but I'm using two of them. Usually on first is running srv and second is running prx with own small mysql DB backend (to hold last 6h monitoring data). Both are not using hosts IPs but own per service dedicated addresses. Manual fail over takes few seconds (below sync period between srv and prx) and I can easily by this for example schedule reboot of one these hosts without affecting monitoring.

              At the moment everything is prepared to put above under the cluster hood (it will be Oracle cluster on Solaris) so after this all operations will be even easier and more predictable.

              With above architecture on restart srv (still 2.2.8) initial checking all ~50k triggers takes less than 2-5s (I have atm 116k monitored items) so restart of the srv still is below sync period between srv and proxies.

              @ Colttt : i have disabled housekeeper process.
              Ya i have tuned zabbix and posgtres for perf. I can provide other configuration details of postgresql and zabbix_server if you want.
              To be honest with you .. in my private opinion using postgresql as DB backend typical warehouse DB like zabbix DB is a little overkill
              Mysql as simpler engine theoretically under such workload will IMO always more or less win against postgresql.
              However if you know better postgresql do not move to other engine
              Last edited by kloczek; 23-01-2015, 12:41.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                I've been asked to share some data about cpu usage in my case on Linux and Solaris. Here is the graph with 3m data where on left side is cpu usage on Linux. After it is gap when master db was on promoted slave (and I had time to reinstall everything and make couple of tests) and on right side is current cpu usage on Solaris. As i wrote I'm using zfs lzjb compression so theoretically now on Solaris cpu usage should be higher .. but it isn't
                Most of the %sys time on Solaris is consumed by compression/decompression threads.
                Attached Files
                Last edited by kloczek; 23-01-2015, 14:53.
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • Colttt
                  Senior Member
                  Zabbix Certified Specialist
                  • Mar 2009
                  • 878

                  #9
                  Originally posted by prakhar
                  StartDBSyncers=16
                  please decrease you Syncers to 8..

                  Zabbix Server has a configuration setting called StartDBSyncers. By default, this value is set to 4. This may seem like a conservative setting, but increasing the value can do more harm than good – I’ll try to explain why.
                  Debian-User

                  Sorry for my bad english

                  Comment

                  • prakhar
                    Junior Member
                    • Jan 2013
                    • 8

                    #10
                    Colttt : thanks for the DBsyncers input. It did help.

                    But still the over all performance is not as expected.

                    kloczek : So the priority now is to partition tables, here are disk size of some of the tables.

                    postgres=# SELECT pg_size_pretty(pg_database_size('zabbix'));
                    pg_size_pretty
                    ----------------
                    57 GB
                    (1 row)

                    zabbix=# select pg_size_pretty(pg_total_relation_size('items'));
                    pg_size_pretty
                    ----------------
                    34 GB
                    (1 row)

                    zabbix=# select pg_size_pretty(pg_total_relation_size('history'));
                    pg_size_pretty
                    ----------------
                    8000 MB
                    (1 row)

                    zabbix=# select pg_size_pretty(pg_total_relation_size('trends'));
                    pg_size_pretty
                    ----------------
                    173 MB
                    (1 row)

                    zabbix=# select pg_size_pretty(pg_total_relation_size('hosts'));
                    pg_size_pretty
                    ----------------
                    52 MB
                    (1 row)

                    As i could see, history table is around 8GB, but my main concern is "items" table it is 34GB. Any query involving items table takes a lot of time.

                    zabbix=# explain analyze select * from items;
                    QUERY PLAN
                    -----------------------------------------------------------------------------------------------------------------------
                    Seq Scan on items (cost=0.00..3431875.54 rows=9995754 width=2497) (actual time=4.007..136778.522 rows=40714 loops=1)
                    Total runtime: 136781.698 ms


                    1. Why has my items table bloated to such a huge size. Is this normal size for a items table have around 50k items with 25 days of data?

                    2. What approach can i take to partition items table?

                    3. Can anyone help me with zabbix query optimizations (DB : postgresql 9.3)
                    Last edited by prakhar; 05-02-2015, 05:48.

                    Comment

                    • jan.garaj
                      Senior Member
                      Zabbix Certified Specialist
                      • Jan 2010
                      • 506

                      #11
                      Your questions should be:

                      1.) Why my item table has huge size?
                      Did you run vacuum command?

                      Did you use LLD feature before? (maybe old discovered items are there)

                      2.) Why the load is 500+?
                      Please post
                      - all last week graphs of Zabbix server (Zabbix performance graphs, CPU load/util, IOPs, network, Postgresql stats, ...)
                      - full output from commands:
                      ps -ef
                      iostat -xk 10 10
                      mpstat -P ALL 10 2
                      - zabbix server config
                      - zabbix server log (errors)
                      Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                      My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                      Comment

                      Working...