Ad Widget

Collapse

Zabbix not write data to DB postgresql, grow queue

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • user.zabbix
    Junior Member
    • Feb 2020
    • 25

    #1

    Zabbix not write data to DB postgresql, grow queue

    Hello!
    We have a problem with queue on zabbix server
    OS Red Hat Enterprise Linux Server release 7.7
    DB postgresql11 + timescaledb-postgresql-11-1.6.0
    Zabbix 4.2.8
    RAM 64 GB
    CPU 10 Core
    HDD SSD ARRAY

    dd if=/dev/zero of=./testfile bs=1G count=40 oflag=direct
    40+0 records in
    40+0 records out
    42949672960 bytes (43 GB) copied, 33.1436 s, 1.3 GB/s

    My ZABBIX really slow write data to DB.

    But iostat and top not show problem with disk or CPU.


    We have't any error in zabbix_server.log

    Our zabbix conf:
    grep -v "^#" /etc/zabbix/zabbix_server.conf|grep -v "^[[:space:]]*$"
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=256
    PidFile=/var/run/zabbix/zabbix_server.pid
    SocketDir=/var/run/zabbix
    DBName=zabbix
    DBUser=zabbix
    DBPassword=-----------
    HistoryStorageDateIndex=1
    StartPollers=120
    StartPreprocessors=32
    StartPollersUnreachable=64
    StartPingers=8
    StartDiscoverers=64
    StartTimers=4
    StartEscalators=4
    StartAlerters=4
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    HousekeepingFrequency=12
    CacheSize=1512M
    StartDBSyncers=16
    HistoryCacheSize=256M
    HistoryIndexCacheSize=128M
    TrendCacheSize=256M
    ValueCacheSize=256M
    Timeout=4
    AlertScriptsPath=/opt/zabbix/zabbix_scripts/alertscripts
    ExternalScripts=/usr/lib/zabbix/externalscripts
    LogSlowQueries=3000
    ProxyConfigFrequency=300
    ProxyDataFrequency=300
    StartLLDProcessors=16
    StatsAllowedIP=127.0.0.1


    Attached Files
    Last edited by user.zabbix; 20-02-2020, 13:25.
  • user.zabbix
    Junior Member
    • Feb 2020
    • 25

    #2
    upgrate zabbix to Zabbix 4.4.5 :-(, result same....
    Last edited by user.zabbix; 20-02-2020, 16:36.

    Comment

    • user.zabbix
      Junior Member
      • Feb 2020
      • 25

      #3
      I tested DB, DB work fine:
      pgbench -c 10 -j 2 -t 10000 zabbix
      starting vacuum...end.
      transaction type: <builtin: TPC-B (sort of)>
      scaling factor: 50
      query mode: simple
      number of clients: 10
      number of threads: 2
      number of transactions per client: 10000
      number of transactions actually processed: 100000/100000
      latency average = 0.952 ms
      tps = 10501.078070 (including connections establishing)
      tps = 10504.435852 (excluding connections establishing)

      pgbench -i -s 50 zabbix
      dropping old tables...
      creating tables...
      generating data...
      5000000 of 5000000 tuples (100%) done (elapsed 6.26 s, remaining 0.00 s)
      vacuuming...
      creating primary keys...
      done.

      Comment

      • user.zabbix
        Junior Member
        • Feb 2020
        • 25

        #4

        there are mesages in system:
        Zabbix unreachable poller processes more than 75% busy
        Zabbix poller processes more than 75% busy
        More than 100 items having missing data for more than 10 minutes

        Comment

        • tim.mooney
          Senior Member
          • Dec 2012
          • 1427

          #5
          You did a great job providing info about your install. I wish every question posed on these forums provided as much useful information as you have!

          How many hosts and more importantly, how many items are you monitoring? What is your server's "new values per second" (NVPS)? You can find # of items and NVPS in Monitoring->Dashboard, in the "widget" for Zabbix System Details.

          The numbers for several of your Zabbix processes have been increased by a large amount from the defaults. I'm talking about specifically

          StartPollers=120
          StartPreprocessors=32
          StartPollersUnreachable=64
          StartDiscoverers=64

          Out of curiousity, were these settings increased after consulting with e.g. Zabbix professional services or Zabbix support, or perhaps after the Zabbix template applied to your server suggested increasing some of these values? Or were they just set to large values when the system was initially configured, since you have a server with lots of available resources? I'm just trying to understand how your site arrived at those settings.

          Beyond the Zabbix settings, the main question I would have is whether any of the PostgreSQL performance tuning tools or suggestions have been applied to your database? The default config for PostgreSQL is generally OK for a variety of uses, but specific workloads can benefit greatly from careful tuning. If your environment is as large as I'm imaginging it is, it may be very necessary to do some pgsql tuning. The pgsql wiki for performance tuning has lots of documentation (not all of it current, which makes it a bit more challenging), and the tools it mentions to help you analyze your config and suggest tuning changes are probably a good place to start?

          Bottlenecks and performance problems in a complex system are some of the most challenging problems to solve. I hope you'll post updates as you continue to diagnose and work on this problem. Probably the best advice I can give is to "make changes carefully". Even if you identify a bunch of things you want to change, I wouldn't change them all at once. Change settings one at a time or in small groups, and then allow enough time to determine whether that one change or set of changes had much impact.

          Comment

          • user.zabbix
            Junior Member
            • Feb 2020
            • 25

            #6
            foreword:
            before build this system we had experience building similar system for 1500 hosts with NVPS ~534(we have a avg CPU load 30% and peak CPU load 70%).

            That system was built on similar hardware and OS.

            We get config for zabbix and postgres from previous system,
            add 200 hosts and have NVPS ~671.

            postgres tuned by timescaledb-tune(with small manual tune ) on both system.

            grep -v "^#" postgresql.conf|cut -d "#" -f1|grep -v "^[[:space:]]*$"
            listen_addresses='*'
            max_connections = 512
            shared_buffers = 7978MB
            work_mem = 10212kB
            maintenance_work_mem = 2047MB
            dynamic_shared_memory_type = posix
            effective_io_concurrency = 200
            max_worker_processes = 19
            max_parallel_workers_per_gather = 4
            max_parallel_workers = 8
            synchronous_commit = off
            wal_buffers = 32MB
            wal_writer_delay = 2000ms
            max_wal_size = 8GB
            min_wal_size = 4GB
            checkpoint_completion_target = 0.9
            random_page_cost = 1.1
            effective_cache_size = 23936MB
            default_statistics_target = 500
            log_destination = 'stderr'
            logging_collector = on
            log_directory = 'log'
            log_filename = 'postgresql-%a.log'
            log_truncate_on_rotation = on
            log_rotation_age = 1d
            log_rotation_size = 0
            log_error_verbosity = verbose
            log_line_prefix = '%m [%p] '
            log_timezone =---------
            autovacuum = off
            autovacuum_max_workers = 10
            autovacuum_naptime = 10
            datestyle = 'iso, mdy'
            timezone = ------------
            lc_messages = 'en_US.UTF8'
            lc_monetary = 'en_US.UTF8'
            lc_numeric = 'en_US.UTF8'
            lc_time = 'en_US.UTF8'
            default_text_search_config = 'pg_catalog.english'
            shared_preload_libraries = 'timescaledb'
            max_locks_per_transaction = 256
            timescaledb.max_background_workers = 8
            timescaledb.last_tuned = '2019-12-11T15:09:26+02:00'
            timescaledb.last_tuned_version = '0.7.0'

            we turn off synchronous_commit
            and increase wal_buffers.
            Last edited by user.zabbix; 22-02-2020, 14:57.

            Comment

            • user.zabbix
              Junior Member
              • Feb 2020
              • 25

              #7
              the problem is we can not see bottlenecks.
              we see huge a queues and Utilization of unreachable poller data collector processes 100 %.

              But top or pg_top not show process with high CPU load or wait :-(

              iostat not show queue on a disks.

              I do experiment and left only one poller but queues grow and grow :-(

              Comment


              • tim.mooney
                tim.mooney commented
                Editing a comment
                I would think that one poller would be too small for your environment, but I guess my hypothesis was that 120 pollers is much too large, and the pollers themselves are running into resource contention issues. That's just pure speculation on my part, though. I haven't run Zabbix with NVPS in the 500-700 range, so I don't have any estimate for what "normal" is for those settings. The Zabbix "Sizing" docs don't go into enough detail for situations like this either, unfortunately.

                If I were in your situation, and I was going to experiment, I would probably divide each of your StartPollers, StartPreprocessors, StartPollersUnreachable, and StartDiscoverers by a factor of 8 to 10, just to see if that makes a difference.

                Alternately, you may want to spend some time looking through the Large Environments section of the forums. Maybe you can find examples of what other people with 500-1000 NVPS are using for those settings? Keep in mind that if they're using a different database backend or if they're using traditional spinning disks, rather than your SSDs, their values still may not be best for your environment, but at least it would give you a range that has worked for other sites without about the same NVPS.
            • user.zabbix
              Junior Member
              • Feb 2020
              • 25

              #8
              i checked postgres, locks not found

              Comment

              • tim.mooney
                Senior Member
                • Dec 2012
                • 1427

                #9
                Thanks for providing the additional information about the PostgreSQL tuning/settings too.

                With NVPS as large as yours, you may want to re-ask your question in the Zabbix for Large Environments area of the forums. I don't know for certain, but there may be people that watch that section of the forum and have experience scaling Zabbix to this size with PostgreSQL + TimeScale.

                I will watch this thread within interest, and if you post in the Large Environments area of the forum I'll watch that one too, but I don't think I have any other help I can offer. My environment is thankfully smaller, and so far scaling Zabbix hasn't been an issue for me.

                Comment

                • Hamardaban
                  Senior Member
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • May 2019
                  • 2713

                  #10
                  You have a large SNMP queue... And i don't see the StartSNMPTrapper parameter in your config ... Is it not listed by mistake, or is it really missing?

                  Comment

                  • user.zabbix
                    Junior Member
                    • Feb 2020
                    • 25

                    #11
                    we are use only poolers because we have active check our network equipments
                    we have a 400 devices with 48 ports per device and 12 metrics per port

                    Comment

                    • Hamardaban
                      Senior Member
                      Zabbix Certified SpecialistZabbix Certified Professional
                      • May 2019
                      • 2713

                      #12
                      Sorry-I made a mistake with "trapper". Of course, you are requesting devices and traps have nothing to do with it.... Have you tried changing the feature for using bulk queries? Do the polled devices themselves have enough resources to process requests?

                      Comment

                      • user.zabbix
                        Junior Member
                        • Feb 2020
                        • 25

                        #13
                        today i create test my database and get tps = 15828.479091 , why zabbix not write with this speed :-(
                        createdb -O postgres -E Unicode -T template0 example
                        pgbench -i -s 500 example
                        pgbench -c 200 -j 200 -t 10000 example
                        result:
                        starting vacuum...end.
                        transaction type: <builtin: TPC-B (sort of)>
                        scaling factor: 500
                        query mode: simple
                        number of clients: 200
                        number of threads: 200
                        number of transactions per client: 10000
                        number of transactions actually processed: 2000000/2000000
                        latency average = 12.644 ms
                        tps = 15817.919698 (including connections establishing)
                        tps = 15828.479091 (excluding connections establishing)
                        -----------------------------------------------------------------------------------
                        pgbench -c 100 -j 100 -t 10000 example
                        starting vacuum...end.
                        transaction type: <builtin: TPC-B (sort of)>
                        scaling factor: 500
                        query mode: simple
                        number of clients: 100
                        number of threads: 100
                        number of transactions per client: 10000
                        number of transactions actually processed: 1000000/1000000
                        latency average = 5.925 ms
                        tps = 16876.988435 (including connections establishing)
                        tps = 16885.724589 (excluding connections establishing)
                        -----------------------------------------------------------------------------------
                        pgbench -c 50 -j 50 -t 10000 example
                        starting vacuum...end.
                        transaction type: <builtin: TPC-B (sort of)>
                        scaling factor: 500
                        query mode: simple
                        number of clients: 50
                        number of threads: 50
                        number of transactions per client: 10000
                        number of transactions actually processed: 500000/500000
                        latency average = 3.504 ms
                        tps = 14271.320883 (including connections establishing)
                        tps = 14275.799162 (excluding connections establishing)

                        Last edited by user.zabbix; 25-02-2020, 10:58.

                        Comment

                        • user.zabbix
                          Junior Member
                          • Feb 2020
                          • 25

                          #14
                          "Do the polled devices themselves have enough resources to process requests"
                          Yes equipment are work properly load on CPU equipment less 20% :-[

                          Comment

                          • user.zabbix
                            Junior Member
                            • Feb 2020
                            • 25

                            #15
                            "Have you tried changing the feature for using bulk queries?"


                            What do yuo mean?

                            "Since Zabbix 2.2.3 Zabbix server and proxy daemons query SNMP devices for multiple values in a single request"

                            Comment

                            Working...