Ad Widget

Collapse

450 max monitored host with Zabbix 3.4 - on going problem for 6 months

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mellis
    Senior Member
    • Oct 2017
    • 145

    #1

    450 max monitored host with Zabbix 3.4 - on going problem for 6 months

    Afternoon
    I am have problems getting the numbe rofmonitored host over 450 with version 3.4.7.
    Currently i have 454 host with 44000 items needing 353 values per second processing
    I run on two servers, one with 8GB ram, 4 vCPUs, this has the zabbix-server and Zabbix-GUI web server on it, the other has 40GB ram 4vCPU a MySQL 5.7,.22 database.

    My problem seem to be that after a hour or less all the host go to an unreachable state. I do not see errors in the zabbix_server.log and if i query the database direct they are still avaiable. I do see some mysql errors

    2018-06-12T15:21:54.221486Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4914ms. The settings might not be optimal. (flushed=1111 and evicted=0, during the time.)
    2018-06-12T15:35:06.444614Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4742ms. The settings might not be optimal. (flushed=2016 and evicted=0, during the time.)
    2018-06-12T15:41:22.550261Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal. (flushed=1612 and evicted=0, during the time.)
    2018-06-12T15:50:48.065072Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4800ms. The settings might not be optimal. (flushed=1923 and evicted=0, during the time.)
    2018-06-12T15:56:36.151414Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6321ms. The settings might not be optimal. (flushed=3368 and evicted=0, during the time.)

    If i do a restart on the zabbix-server these will clear up for again 30mins to an hour.

    attached are some graphs of the zabbix processes.

    zabbix_server.conf
    StartPollers=464
    # StartIPMIPollers=0
    StartPreprocessors=36
    StartPollersUnreachable=48
    StartTrappers=56
    # StartPingers=1
    # StartDiscoverers=1
    # StartHTTPPollers=1
    StartTimers=16
    StartEscalators=10
    # JavaGatewayPort=10052
    # StartJavaPollers=0
    # VMwareFrequency=60
    # VMwarePerfFrequency=60
    # VMwareCacheSize=8M
    # VMwareTimeout=10
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    # StartSNMPTrapper=0
    # ListenIP=127.0.0.1
    # HousekeepingFrequency=1
    MaxHousekeeperDelete=100000
    # SenderFrequency=30
    CacheSize=512M
    CacheUpdateFrequency=900
    StartDBSyncers=54
    HistoryCacheSize=1024M
    HistoryIndexCacheSize=128M
    TrendCacheSize=128M
    ValueCacheSize=256M
    Timeout=30
    # TrapperTimeout=300
    # UnreachablePeriod=45
    # UnavailableDelay=60
    # UnreachableDelay=15
    AlertScriptsPath=/usr/lib/zabbix/alertscripts
    ExternalScripts=/usr/lib/zabbix/externalscripts

    my.cnf

    show_compatibility_56 = ON
    performance_schema
    innodb_data_home_dir=/home/mysql
    innodb_flush_log_at_trx_commit = 2
    innodb_file_per_table = 1
    innodb_buffer_pool_instances = 23
    innodb_buffer_pool_size = 22G
    innodb_page_cleaners=64
    innodb_flush_method = O_DIRECT
    innodb_doublewrite = 0
    innodb_checksum_algorithm=NONE
    innodb_read_io_threads = 64
    innodb_write_io_threads = 64
    tmp_table_size = 42M
    max_heap_table_size = 42M
    max_allowed_packet = 1024M
    thread_cache_size = 14
    join_buffer_size = 2048M
    sort_buffer_size = 4M
    read_rnd_buffer_size = 4M
    query_cache_size = 0
    query_cache_type = 0
    query_cache_limit = 2M
    max_connections = 724
    wait_timeout = 9600
    interactive_timeout = 9600
    datadir=/home/mysql
    socket=/var/lib/mysql/mysql.sock
    # Disabling symbolic-links is recommended to prevent assorted security risks
    symbolic-links=0
    #logging stuff
    log-error=/var/log/mysqld.log
    slow_query_log = 1
    slow_query_log_file = /var/log/slow_quiery.log
    pid-file=/var/run/mysqld/mysqld.pid


    Attached Files
  • kernbug
    Senior Member
    • Feb 2013
    • 330

    #2
    Hello, Mellis

    You have overturned Zabbix Server configuration file with very small NVPS value. Mutual locks from useless process will lockdown your performance.

    Make a backup of the current configuration file and try this:
    Code:
    ### zabbix_server.conf
    StartPollers=30
    # StartIPMIPollers=0
    StartPreprocessors=10
    StartPollersUnreachable=10
    StartTrappers=10
    # StartPingers=1
    # StartDiscoverers=1
    # StartHTTPPollers=1
    StartTimers=16
    StartEscalators=10
    
    ## Yes, start with small amount: 
    StartDBSyncers=4
    Also add this: SET optimizer_switch='index_condition_pushdown=off' to the MySQL;

    Comment

    • kloczek
      Senior Member
      • Jun 2006
      • 1771

      #3
      Originally posted by kernbug
      Also add this: SET optimizer_switch='index_condition_pushdown=off' to the MySQL;
      Nope. Sorry to say this but you are wrong
      In this case it looks like it is time to stop plowing DB storage using delete queries and it is time to introduce at least partitioned history*tables.
      Join the friendly and open Zabbix community on our forums and social media platforms.

      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment

      • kernbug
        Senior Member
        • Feb 2013
        • 330

        #4
        Originally posted by kloczek

        Nope. Sorry to say this but you are wrong
        In this case it looks like it is time to stop plowing DB storage using delete queries and it is time to introduce at least partitioned history*tables.
        Join the friendly and open Zabbix community on our forums and social media platforms.

        With 10% max of history syncer? Ok, in any case partitioning is the right way for growing instance.

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Originally posted by kernbug


          With 10% max of history syncer? Ok, in any case partitioning is the right way for growing instance.
          In this case more important is what shows zabbix[wcache,history,used]
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • mellis
            Senior Member
            • Oct 2017
            • 145

            #6
            Made this change the poller is running at 100%, the restart did recover all of the host .

            The problem has gone away, but th epoller is hanging at 100%,,, is this a problem?

            Comment

            • kernbug
              Senior Member
              • Feb 2013
              • 330

              #7
              Originally posted by mellis
              Made this change the poller is running at 100%
              Hello

              Partitioning or config changes?

              Comment

              • mellis
                Senior Member
                • Oct 2017
                • 145

                #8
                I did the configuation changes and the problem cleared up, I did increase the pollers a bit to get it off the 100%. i am still seeing the error in the mysql.log. I plan to do the partitioning this week when i get a couple hours free. I wanted to do a backup of the database first.

                Thanks all that replied with the issue.

                Comment

                • kernbug
                  Senior Member
                  • Feb 2013
                  • 330

                  #9
                  Originally posted by mellis
                  I did the configuation changes and the problem cleared up, I did increase the pollers a bit to get it off the 100%. i am still seeing the error in the mysql.log. I plan to do the partitioning this week when i get a couple hours free. I wanted to do a backup of the database first.

                  Thanks all that replied with the issue.
                  Please, note that messages about page cleaner are about mysql misconfiguration or hardware performance problem:
                  Code:
                  2018-06-12T15:21:54.221486Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4914ms. The settings might not be optimal. (flushed=1111 and evicted=0, during the time.)
                  You should upgrade to the most recent version for documentation support. The documentation for product versions that are either outdated or have reached their end-of-life is available in a PDF format. Outdated documentation is defined as the documentation for versions that are no longer actively maintained, but these versions are not declared end-of-life. Percona does not update this documentation.

                  https://stackoverflow.com/questions/...-loop-took-xxx

                  Comment

                  Working...