Ad Widget

Collapse

Zabbix server to migrate from VM standalone to physical box

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tamas-p
    Junior Member
    • Aug 2018
    • 15

    #1

    Zabbix server to migrate from VM standalone to physical box

    Hi,

    I was asked to move a few years old zabbix server instance from VM machine to a dedicated server box.
    However I see major performance issues when I have move over all the data and config.
    I did many steps, and tried many different things, but I can not get my head around where is the issue.

    The source system:
    Centos 6.5 OS, Zabbix 2.2.2, Mysql Server version: 5.1.69
    VM has 4 cpu core and 16Gig ram assigned. Storage is on a dedicated SAN.
    Monitors about 650 host with a bit more than 45000 items, 90-95% SNMP based items.
    CPU and memory 100% maxed out but no swapping.
    The installation works about fine at the moment, but as we wanted to add more items (like not just port status but traffic metric on all ports of the switches (would mean an extra 10000 items), the systems become almost unusable slow on the web interface. Collection still works just some oncrease in the queue waits, but web interface will load like 5-10 sec on each click. and auto refresh of the pages being delayed.

    The first target system:
    Centos 7.4, Zabbix 2.2.23, Server version: 5.5.56-MariaDB MariaDB Server.
    2x4 Core CPU + HT, 16 Gig ram installed, raid 1 SSD storage

    Other config I have tried with very same result:
    Centos 7.4, Zabbix 2.2.23, Server version: 5.5.56-MariaDB MariaDB Server.
    2x8 Core CPU + HT, 192 Gig ram installed, raid 5 enterprise HDD

    Scenario 1:
    I perform a clean install on target system, install the packages from standard centos repos, install zabbix from zabbix repo, as it is on the website.
    dump the zabbix database on source (90Gig), copy over all the config files (zabbix server/agent/ external scripts, my.cnf), change only values refer to the server address (rename), import sql data, start up zabbix server on target system.
    The transport is seemingly successful, everything works, I can login with user created in the source, can see all the hosts and templates.
    However in both HW config, I see the 10min queue is constantly have about 5-7000 items in it. Which is not the case with VM source. Most of the items eventually updates, but I see like only intermittent (once in every half hour or even once in every 4 hours) values on many important graphs where should be a 30sec or at least less than a minute collection time.
    At the same time, I see also lot unreachable/ time out entries in the logs to many hosts, which is definitely not true, as same host works fine in the VM source and also by using other means of monitoring.

    Scenario 2:
    Target machine is installed with Zabbix 3.4 LTS or 4.0.1 version. Move over monitoring by exports from the source zabbix web interface (Templates, hosts, etc..), no sql level dumping or anything like that.
    Again I have tried this on both HW, and as soon as the monitored items reach into the level of 1000, I start to see 10min queue items.

    Scenario 3:
    I install the big machine as zabbix server and sql database server, and small machine become a proxy and web interface with local proxy data store for 24 hours caching.
    Result and approach is same as scenario 2...

    Also it it is also obviously similar in all cases, the HW boxes use super minimal of CPU resource (like none of the core utilise over 10%) and memory usage also not really goes over 10Gig, as oppose of the VM where 4 core is constant 100% and memory too.
    In the VM process list I can see the pollers and other zabbix instances always and almost all of them have something to do, but on the physical boxes I see only intermittent usage on them, regardless how many instances I use. (I have tried 20 to 100, the VM runs with 50)

    I have spent wast amount of time, to look at logs, but apart from the "become unreachable, and reachable" messages I don't see any reason for it. Just time out, which is obviously not true.

    I did write a small test script which polls a host which is very commonly shows as unreachable on the target system, I issued 100 snmpwalk with 1 sec frequency, and only 3 of them come back in 3 sec the rest were come back in below 1 sec, even the walk was fairly big, as it was all about 8 different metric for 36HDD in a NAS. (Just for the clarity the test was run on both physical boxes)

    Scenario 4:
    I just reinstalled from scratch the full system on the bigger box. Installed Zabbix 4 again, and started to manually re-create the templates for 2 routers and and a NAS system with 5 nodes. When I finished with the templates and added all the host I already started to see some 10min queue however there is only 7 host and 1300 items to monitor. And 2 nas hosts still reports timeouts, and not created all the discovery items after like 24 hours..
    The CPU is usage on the machine is basically zero.. and after like 36 hours running memory is just below 3Gig..
    I have spent like 2 days, to try to optimise the SQL side, if that would be the problem, but it is not possible, with such low amount of data, an out of box sql install wouldn't perform on ANY machine...

    Also just an addition, disk IO on the server in Scenario 4, is avarage 50kb/s read and around 100Kb/s write, network interface peak traffic is around 400Kb/s including the terminal traffic I'm working on...

    So at this point, I have no more idea where is the bottleneck in the system, why is a 10 times stronger physical box can not even reach the performance of a low level VM...

    Could anyone help me, or just give some ideas where I should look?

    Thanks





  • tamas-p
    Junior Member
    • Aug 2018
    • 15

    #2
    ## Zabbix server config on Scenario 4 ###
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=100
    LogSlowQueries=3000

    # Range: 0-5
    # Default: 3
    DebugLevel=3


    PidFile=/var/run/zabbix/zabbix_server.pid


    ## MySQL ##
    # DBHost=localhost
    # DBPort=3306
    DBName=zabbix
    DBUser=zabbix

    StartPollers=30
    # StartIPMIPollers=0
    StartPollersUnreachable=3
    # StartTrappers=5
    # StartPingers=1
    StartDiscoverers=3
    # StartHTTPPollers=1
    StartTimers=3

    # HousekeepingFrequency=1
    MaxHousekeeperDelete=5000

    # SenderFrequency=30

    CacheSize=256M
    # CacheUpdateFrequency=60
    StartDBSyncers=8
    HistoryCacheSize=128M
    TrendCacheSize=64M
    #HistoryTextCacheSize=128M
    ValueCacheSize=256M

    Timeout=20
    # TrapperTimeout=300
    UnreachablePeriod=30
    UnavailableDelay=120
    UnreachableDelay=60

    Comment

    • tamas-p
      Junior Member
      • Aug 2018
      • 15

      #3
      [mysqld]
      #user=mysql

      #skip-name-resolve=1

      #table_open_cache=64
      #
      query_cache_size=128M
      query_cache_limit=8M
      #
      sort_buffer_size=16M
      join_buffer_size=32M
      read_buffer_size=32M
      max_connections=100

      key_buffer_size=256M
      thread_cache_size=8

      # innodb
      innodb=on
      innodb_file_per_table=1
      innodb_flush_method=O_DIRECT
      innodb_buffer_pool_instances=4
      innodb_buffer_pool_size=4G
      innodb_log_file_size=48M


      log_slow_queries=1
      slow_query_log_file=/var/log/mariadb/slow_query.log

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        Originally posted by tamas-p
        query_cache_size=128M
        query_cache_limit=8M
        You can change above to minimal values allowed by mysqld configuration.
        Because almost all selects are operating on constantly moving time window data all queries have ~100% misses and spending any memory on query cache results is waste of memory.
        This effect is possible to observe using my Service MySQL template which has implemented query cache effectiveness monitoring.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • tamas-p
          Junior Member
          • Aug 2018
          • 15

          #5
          Originally posted by kloczek

          You can change above to minimal values allowed by mysqld configuration.
          Because almost all selects are operating on constantly moving time window data all queries have ~100% misses and spending any memory on query cache results is waste of memory.
          This effect is possible to observe using my Service MySQL template which has implemented query cache effectiveness monitoring.
          Hi,

          This particular is set as the same in the source VM instance. And however it is not a problem to change it, and the the big machine I'm using as physical box has 200Gig ram, but never seen memory usage over 10+few gig.
          I will try your templates, to see if it uncovers something.

          Thanks


          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by tamas-p
            This particular is set as the same in the source VM instance. And however it is not a problem to change it, and the the big machine I'm using as physical box has 200Gig ram, but never seen memory usage over 10+few gig.
            I will try your templates, to see if it uncovers something.
            If you are still using innodb_buffer_pool_size=4G you DB will be using only 4GB + some page cache memory.
            Just increase innodb_buffer_pool_size and start observing MySQL internal metrics about memory usage.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • tamas-p
              Junior Member
              • Aug 2018
              • 15

              #7
              Ok, when I try to add your sql tamplates, I had to face with the fact it is not really compatible with my mariadb setup. So I have decided to rebuild the big machine with a MySql community server 5.7 (I have used dump and import for the data)
              The same issue is present, in addition some reason the snmp path were broken. snmpwalk claims the path is /usr/share/snmp/mibs but still wasn't using any of the mibs in there, regardless file permissions... I have got around this by creating a /etc/snmp/snmp,conf and added mibdirs option.
              Zabbix still claims the MIB files aren't there.. but snmpwalk works... in previous installations this was never an issue.. anyway I have fixed this by manually creating /etc/snmp/snmp.conf

              MySQL doesn't make a difference over mariadb as the same problem exist.

              This is what I monitor at the moment on the "big" hardware (lot of cpu, lot of ram no system resource is in use at all):
              94 9 / 0 / 85
              1677 1671 / 0 / 6
              569 569 / 0 [1 / 568]

              But still the 60s CPU performance poll only get an actual value in ever 30-90 minutes.. completely intermittent.. some times it works fine for couple minutes but mostly it just hangs. At the same time time are no queue pile up at all on the system.. as I mentioned CPU and memory use on the server is below 10% or even below 5%...
              The logs continuously complain about the hosts snmp timed out. But at the same same time if I run that shell script from zabbix server to query a big walk every second, it works fine for a 100 iteration...

              I still have no idea why zabbix itself can not keep with polls in here..

              Comment

              Working...