Ad Widget

Collapse

Zabbix 3.2 is extremely slow with 20000 monitored hosts.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bjornskau
    Junior Member
    • Apr 2018
    • 17

    #1

    Zabbix 3.2 is extremely slow with 20000 monitored hosts.

    Hi guys!
    I haven't used zabbix in such a huge entherprise, so I am feeling a bit confused about it's speed. Can anybody give me some advice?(

    I use three node MariaDB Galera Cluster , each node is a Hyper-V VM located at separate Hyper-V server and has 72GB RAM ,16 vCPU, dynamically allocated VHDX. Configuration is equal on each node and reads as follows:

    cat /etc/my.cnf
    #
    # This group is read both both by the client and the server
    # use it for options that affect everything
    #
    [client-server]

    [mysqld]
    general_log_file = /var/log/mysqld.log
    general_log = 1

    log-error = /var/log/mysqld.error.log



    wsrep_on=ON
    wsrep_node_name=r61zabbixdb02
    wsrep_node_address="<ip-node-a>"
    wsrep_provider=/usr/lib64/galera/libgalera_smm.so
    wsrep_provider_options="gcache.size=10G;gcache.rec over=yes"
    wsrep_cluster_name="zabbix"
    wsrep_cluster_address="gcomm://<ip-node-a>,<ip-node-b>,<ip-node-c>
    wsrep_sst_method=rsync

    max_connections=3000
    datadir=/var/lib/mysql
    socket=/var/lib/mysql/mysql.sock
    user=mysql
    binlog_format=ROW
    default_storage_engine=innodb
    innodb_autoinc_lock_mode=2
    innodb_flush_log_at_trx_commit=0
    innodb_buffer_pool_size=49152M
    innodb_buffer_pool_instance=16
    innodb_old_blocks_time=1000
    query_cache_size = 128M
    sync_binlog=0

    slow_query_log = 1
    slow_query_log_file = /var/lib/mysql/slow_queries.log
    long_query_time = 0.05
    log-queries-not-using-indexes = 1
    max_allowed_packet=64M
    character_set_server=utf8
    collation-server=utf8_bin
    init_connect="SET NAMES utf8 collate utf8_bin"

    [mysql_safe]
    log-error = /var/log/mysqld.error.log
    pid-file=/var/run/mysqld/mysqld.pid


    #
    # include all files from the config directory
    #

    Zabbix server is also in Hyper-V VM and
    zabbix_server.conf is as follows:

    StartPollers=300
    StartPollersUnreachable=100
    StartDiscoverers=250
    StartPingers=5
    CacheUpdateFrequency=150
    StartTrappers=300
    StartEscalators=10
    StartDBSyncers=100
    HistoryCacheSize=512M
    HistoryIndexCacheSize=512M
    CacheSize=6G
    ValueCacheSize=1G
    ListenPort=10051

    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=0
    DebugLevel=3
    PidFile=/var/run/zabbix/zabbix_server.pid
    DBHost=<floating ip of haproxy cluster>
    DBName=zabbix
    DBUser=zabbix
    DBPassword=aqNt7BFNT29OrMIYDb4F
    DBPort=3306
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    ListenIP=<zabbix-server-ip>
    HousekeepingFrequency=1
    MaxHousekeeperDelete=100000
    TrendCacheSize=1024M
    Timeout=4
    AlertScriptsPath=/usr/lib/zabbix/alertscripts
    ExternalScripts=/usr/lib/zabbix/externalscripts
    LogSlowQueries=3000



    Zabbix server connects to MariaDB Cluster via HAProxy which is located on two-node Pacemaker Cluster, on which also Zabbix fronted is. Each node of HAProxy cluster has 16 RAM and 8 vCPU.






  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    1)Number of monitored hosts has nothing to do with zabbix performance. Only relevant factor is NVPS or effective bandwidth of the monitoring data. You may have 1000 hosts with one metric per host or one host with 1000 metrics and from performance point of view those two cases will be equal. 2) Using horizontally scaled active-active DB backend has its own performance impact and using such setup will not scale bond some flow to/into DB backennd 3) look on your DB backend hosts IO statistics.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • bjornskau
      Junior Member
      • Apr 2018
      • 17

      #3
      Originally posted by kloczek
      1)Number of monitored hosts has nothing to do with zabbix performance. Only relevant factor is NVPS or effective bandwidth of the monitoring data. You may have 1000 hosts with one metric per host or one host with 1000 metrics and from performance point of view those two cases will be equal. 2) Using horizontally scaled active-active DB backend has its own performance impact and using such setup will not scale bond some flow to/into DB backennd 3) look on your DB backend hosts IO statistics.


      My NVPS is 28827.97
      Here is output of pt-diskstats for
      db02 (which acts as primary at time of testing)
      pt-diskstats --interval=20 --iterations=3

      #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
      14.6 sda 0.0 0.0 0.0 0% 0.0 0.0 108.0 6.7 0.7 6% 0.0 0.3 3% 0 108.0 0.0 0.2
      14.6 sda3 0.0 0.0 0.0 0% 0.0 0.0 47.9 15.1 0.7 13% 0.0 0.4 1% 0 47.9 0.1 0.3
      14.6 dm-0 0.0 0.0 0.0 0% 0.0 0.0 112.2 6.4 0.7 0% 0.0 0.3 3% 0 112.2 0.1 0.2

      20.0 sda 0.0 0.0 0.0 0% 0.0 0.0 167.2 7.2 1.2 4% 0.1 0.3 4% 0 167.2 0.1 0.2
      20.0 sda3 0.0 0.0 0.0 0% 0.0 0.0 74.6 16.0 1.2 9% 0.0 0.4 2% 0 74.6 0.1 0.3
      20.0 dm-0 0.0 0.0 0.0 0% 0.0 0.0 171.8 7.0 1.2 0% 0.1 0.3 4% 0 171.8 0.1 0.2

      20.0 sda 0.4 512.0 0.2 0% 0.0 19.4 493.8 3.7 1.8 2% 0.1 0.3 11% 0 494.2 0.0 0.2
      20.0 sda3 0.4 512.0 0.2 0% 0.0 19.4 192.1 9.6 1.8 5% 0.1 0.3 6% 0 192.5 0.1 0.3
      20.0 dm-0 0.4 512.0 0.2 0% 0.0 19.5 500.4 3.7 1.8 0% 0.1 0.3 11% 0 500.8 0.1 0.2

      db01 (which acts as secondary):
      #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
      11.9 sda 0.1 16.0 0.0 0% 0.0 8.0 748.6 2.4 1.8 0% 0.3 0.3 17% 0 748.7 0.1 0.2
      11.9 sda2 0.0 0.0 0.0 0% 0.0 0.0 0.2 4.0 0.0 0% 0.0 0.5 0% 0 0.2 0.0 0.5
      11.9 sda3 0.1 16.0 0.0 0% 0.0 8.0 179.2 10.2 1.8 1% 0.1 0.4 5% 0 179.3 0.1 0.3
      11.9 dm-0 0.1 16.0 0.0 0% 0.0 8.0 462.0 4.0 1.8 0% 0.2 0.4 16% 0 462.1 0.1 0.4

      20.0 sda 0.0 0.0 0.0 0% 0.0 0.0 744.9 2.2 1.6 0% 0.2 0.3 16% 0 744.9 0.1 0.2
      20.0 sda2 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0
      20.0 sda3 0.0 0.0 0.0 0% 0.0 0.0 175.7 9.2 1.6 0% 0.1 0.4 5% 0 175.7 0.1 0.3
      20.0 dm-0 0.0 0.0 0.0 0% 0.0 0.0 456.8 3.5 1.6 0% 0.2 0.4 16% 0 456.8 0.0 0.3

      20.0 sda 0.0 0.0 0.0 0% 0.0 0.0 696.3 2.8 1.9 0% 0.2 0.3 15% 0 696.3 0.1 0.2
      20.0 sda2 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0
      20.0 sda3 0.0 0.0 0.0 0% 0.0 0.0 169.2 11.5 1.9 0% 0.1 0.4 5% 0 169.2 0.1 0.3
      20.0 dm-0 0.0 0.0 0.0 0% 0.0 0.0 429.3 4.5 1.9 0% 0.2 0.4 15% 0 429.3 0.1

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        "IO statistics"-> number of IO/s (read and write)
        Do you have disks IO monitoring on these hosts?
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • bjornskau
          Junior Member
          • Apr 2018
          • 17

          #5
          Originally posted by kloczek
          "IO statistics"-> number of IO/s (read and write)
          Do you have disks IO monitoring on these hosts?
          Actually, no.(

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by bjornskau
            Actually, no.(
            Usually I'm calling this kind of situations as "shoemaker without shoes"
            If you are using standard OOTB zabbix templates and your DB backend is working under Linux it is not your (kind of) fault as those templates does not provide IO statistics.
            If you want you can use mine "OS Linux" template
            It has additional "DSK:" LLD which adds 4 items prototypes:

            {#DISK}::read::bytes vfs.dev.read[{#DISK},sectors]
            {#DISK}::read::IOs vfs.dev.read[{#DISK},operations]
            {#DISK}::write::bytes vfs.dev.write[{#DISK},sectors]
            {#DISK}::write::IOs vfs.dev.write[{#DISK},operations]

            two graphs prototypes: DSK::{#DISK}::bytes, DSK::{#DISK}::IOs.
            And one screen "DSK" which presents all graphs together.
            Feel free to use it/criticise and/or contribute
            Last edited by kloczek; 13-04-2018, 10:36.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            Working...