Ad Widget

Collapse

Zabbix server locks up when starts syncing history after housekeeper

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Guntis
    Junior Member
    • Mar 2019
    • 18

    #1

    Zabbix server locks up when starts syncing history after housekeeper

    Hi!

    The issue is that every one in a while Zabbix server stops processing any items (not at constant time). When checking zabbix_server.log I can see lines:
    31024:20201125:140556.193 slow query: 703.504010 sec, "insert into history_uint (itemid,clock,ns,value) values (64913,1606305249,237080617,1),... very long query ....
    31025:20201125:140556.331 slow query: 704.191528 sec, "insert into history_uint (itemid,clock,ns,value) values (134288,1606305249,358225838,1),... very long query ....

    At this point CPU Load is around 80%, Available memory 2.7G out of 4GB.

    after around 5-20 minutes it finishes it's hard work and things get back to normal.

    I am unable to determine the cause for this issue, that started appearing recently after upgrading from Zabbix 4.4 to 5.2. Maybe problem is caused due to additional hosts added lately.

    Database is MariaDB-10.5.8 running on the same host as Zabbix server. Zabbix server itself monitors only few web scenarios and ping hosts. Mosly receives all data from 7 proxies.

    Related zabbix_server.conf values:
    # HistoryCacheSize=16M
    # HistoryIndexCacheSize=4M
    ValueCacheSize=500M

    Number of hosts (enabled 288)
    Number of items (enabled 19537/disabled 757/not supported 285)
    Number of triggers (enabled 7145)
    Required server performance, new values per second 256.52

    Please see screenshots attached for detailed information.
    As can been seen from the graphs, it looks like housekeeper starts up, goes to 100% busy and right after that Zabbix history write cache gradually fills up to 100% and causes the lock up.

    Click image for larger version

Name:	Screenshot 2020-11-25 at 15.04.49.png
Views:	6903
Size:	53.7 KB
ID:	413912

    I would be glad to get any advice on which direction to look in to. Maybe there are some database optimizations that should be done? Or perhaps server resources are not enough for this environment?
    Attached Files
  • emiliano
    Junior Member
    Zabbix Certified Specialist
    • Aug 2013
    • 1

    #2
    Hi,
    we have the same problem after updating zabbix from 4.4 to 5.0.5 and mariadb from 5.6 to 10.4.17.
    The problem repeats exactly every 3 days and 4 hours and mariadb stops each time with the following log:
    Code:
    Nov 21 17:34:06 zabbix mysqld[1110]: 2020-11-21 17:34:06 0 [Warning] InnoDB: A long semaphore wait:
    ...
    Nov 21 17:51:08 zabbix mysqld[1110]: 2020-11-21 17:51:08 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
    And here are the logs before mariadb's suicide:
    Code:
    2710:20201121:173127.446 slow query: 51.385802 sec, "update hosts set lastaccess=1605976236 where
    2707:20201121:173133.453 slow query: 51.354793 sec, "update hosts set lastaccess=1605976242 where
    2708:20201121:173139.458 slow query: 51.327168 sec, "update hosts set lastaccess=1605976248 where
    2701:20201121:173140.458 slow query: 51.049356 sec, "update hosts set snmp_disable_until=160597630
    2710:20201121:173218.480 slow query: 51.022463 sec, "update hosts set lastaccess=1605976248 where
    2707:20201121:173224.481 slow query: 51.018207 sec, "update hosts set lastaccess=1605976293 where
    2708:20201121:173230.484 slow query: 51.002428 sec, "update hosts set lastaccess=1605976293 where
    2710:20201121:173309.498 slow query: 50.949643 sec, "update hosts set lastaccess=1605976338 where
    2707:20201121:173319.500 slow query: 50.984205 sec, "update hosts set lastaccess=1605976348 where
    2708:20201121:173329.505 slow query: 51.005927 sec, "update hosts set lastaccess=1605976358 where
    2710:20201121:173404.522 slow query: 50.942265 sec, "update hosts set lastaccess=1605976358 where
    2707:20201121:173410.704 slow query: 51.172837 sec, "update hosts set lastaccess=1605976399 where
    2708:20201121:173420.707 slow query: 51.185864 sec, "update hosts set lastaccess=1605976409 where
    2710:20201121:173507.717 slow query: 51.165387 sec, "update hosts set lastaccess=1605976456 where
    2745:20201121:173514.810 slow query: 279.696648 sec, "update item_discovery set lastcheck=16059762
    2706:20201121:173514.810 slow query: 265.402609 sec, "update hosts set snmp_disable_until=16059763
    2699:20201121:173514.811 slow query: 255.396068 sec, "update hosts set snmp_disable_until=16059763
    2704:20201121:173514.811 slow query: 255.395362 sec, "update hosts set snmp_disable_until=16059763
    2698:20201121:173514.811 slow query: 269.404951 sec, "update hosts set snmp_disable_until=16059763
    2697:20201121:173514.812 slow query: 255.395408 sec, "update hosts set snmp_disable_until=16059763
    2703:20201121:173514.812 slow query: 254.387680 sec, "update hosts set snmp_disable_until=16059763
    2700:20201121:173514.812 slow query: 255.388029 sec, "update hosts set snmp_disable_until=16059763
    2702:20201121:173514.813 slow query: 255.396638 sec, "update hosts set snmp_disable_until=16059763
    2744:20201121:173514.813 slow query: 281.704440 sec, "update item_discovery set lastcheck=16059762
    2701:20201121:173514.813 slow query: 210.348793 sec, "update hosts set snmp_disable_until=16059763
    2705:20201121:173514.815 slow query: 255.398856 sec, "update hosts set snmp_disable_until=16059763
    2708:20201121:173514.815 slow query: 40.246506 sec, "update hosts set lastaccess=1605976474 where
    2623:20201121:173514.818 slow query: 308.723137 sec, "insert into history_uint (itemid,clock,ns,va
    2711:20201121:173514.820 slow query: 284.797283 sec, "update hosts set lastaccess=1605976230 where
    2707:20201121:173514.821 slow query: 49.147088 sec, "update hosts set lastaccess=1605976458 where
    2617:20201121:173514.822 slow query: 274.210491 sec, "update httptest set nextcheck=1605976360 whe
    2613:20201121:173514.822 slow query: 278.178298 sec, "update httptest set nextcheck=1605976296 whe
    2609:20201121:173514.822 slow query: 234.396091 sec, "update httptest set nextcheck=1605976340 whe
    2709:20201121:173514.822 slow query: 12.685613 sec, "update hosts set lastaccess=1605976502 where
    2610:20201121:173514.823 slow query: 264.448318 sec, "update httptest set nextcheck=1605976370 whe
    2612:20201121:173514.823 slow query: 276.887393 sec, "update httptest set nextcheck=1605976297 whe
    2615:20201121:173514.824 slow query: 263.286283 sec, "update httptest set nextcheck=1605976311 whe
    2710:20201121:173514.826 slow query: 6.041417 sec, "update hosts set lastaccess=1605976502 where h
    2611:20201121:173514.827 slow query: 271.602884 sec, "update httptest set nextcheck=1605976543 whe
    2614:20201121:173514.828 slow query: 279.457556 sec, "update httptest set nextcheck=1605976295 whe
    2624:20201121:173514.886 slow query: 268.347342 sec, "insert into alerts (alertid,actionid,eventid
    2607:20201121:173514.909 slow query: 220.588564 sec, "delete from problem where r_clock<>0 and r_c
    2622:20201121:173514.930 slow query: 307.679881 sec, "insert into history_uint (itemid,clock,ns,va
    2619:20201121:173515.040 slow query: 308.993701 sec, "insert into history_uint (itemid,clock,ns,va
    2607:20201121:173557.231 slow query: 39.393544 sec, "delete from history_uint where itemid=123061
    2607:20201121:173625.324 slow query: 28.092473 sec, "delete from history_uint where itemid=123060
    2709:20201121:173700.731 slow query: 51.552254 sec, "update hosts set lastaccess=1605976569 where
    2707:20201121:173706.732 slow query: 51.518776 sec, "update hosts set lastaccess=1605976575 where
    2710:20201121:173712.733 slow query: 51.485032 sec, "update hosts set lastaccess=1605976581 where
    2711:20201121:173718.734 slow query: 51.449613 sec, "update hosts set lastaccess=1605976587 where
    2709:20201121:173751.740 slow query: 51.002264 sec, "update hosts set lastaccess=1605976587 where
    2707:20201121:173757.741 slow query: 51.004262 sec, "update hosts set lastaccess=1605976626 where
    2711:20201121:173813.745 slow query: 50.993096 sec, "update hosts set lastaccess=1605976642 where
    2710:20201121:173825.747 slow query: 51.226312 sec, "update hosts set lastaccess=1605976644 where
    2707:20201121:173857.762 slow query: 51.695798 sec, "update hosts set lastaccess=1605976686 where
    2711:20201121:173908.765 slow query: 51.010108 sec, "update hosts set lastaccess=1605976697 where
    2710:20201121:173920.760 slow query: 50.883668 sec, "update hosts set lastaccess=1605976709 where
    2707:20201121:173948.761 slow query: 50.980386 sec, "update hosts set lastaccess=1605976709 where
    2711:20201121:173959.763 slow query: 50.983482 sec, "update hosts set lastaccess=1605976748 where
    2710:20201121:174015.766 slow query: 50.853976 sec, "update hosts set lastaccess=1605976748 where
    2707:20201121:174043.769 slow query: 50.970429 sec, "update hosts set lastaccess=1605976764 where
    2711:20201121:174050.767 slow query: 50.963282 sec, "update hosts set lastaccess=1605976799 where
    2710:20201121:174106.769 slow query: 51.002169 sec, "update hosts set lastaccess=1605976799 where
    2711:20201121:174141.773 slow query: 50.784993 sec, "update hosts set lastaccess=1605976850 where
    2707:20201121:174154.774 slow query: 51.412999 sec, "update hosts set lastaccess=1605976851 where
    2710:20201121:174201.775 slow query: 50.884476 sec, "update hosts set lastaccess=1605976866 where
    2711:20201121:174232.778 slow query: 50.955957 sec, "update hosts set lastaccess=1605976901 where
    2707:20201121:174245.780 slow query: 50.985191 sec, "update hosts set lastaccess=1605976914 where
    2710:20201121:174254.781 slow query: 51.177475 sec, "update hosts set lastaccess=1605976914 where
    2711:20201121:174327.786 slow query: 50.991281 sec, "update hosts set lastaccess=1605976923 where
    2707:20201121:174345.796 slow query: 51.492528 sec, "update hosts set lastaccess=1605976974 where
    2710:20201121:174356.787 slow query: 51.725690 sec, "update hosts set lastaccess=1605976974 where
    2711:20201121:174429.792 slow query: 51.700635 sec, "update hosts set lastaccess=1605977018 where
    2707:20201121:174448.795 slow query: 50.814887 sec, "update hosts set lastaccess=1605977018 where
    2710:20201121:174455.796 slow query: 50.963721 sec, "update hosts set lastaccess=1605977044 where
    2711:20201121:174524.800 slow query: 51.002339 sec, "update hosts set lastaccess=1605977044 where
    2707:20201121:174539.801 slow query: 50.804170 sec, "update hosts set lastaccess=1605977088 where
    2710:20201121:174546.803 slow query: 51.005672 sec, "update hosts set lastaccess=1605977088 where
    2711:20201121:174615.809 slow query: 50.969866 sec, "update hosts set lastaccess=1605977124 where
    2707:20201121:174638.811 slow query: 50.857491 sec, "update hosts set lastaccess=1605977124 where
    2710:20201121:174645.812 slow query: 50.989810 sec, "update hosts set lastaccess=1605977147 where
    2711:20201121:174715.817 slow query: 50.822845 sec, "update hosts set lastaccess=1605977176 where
    2710:20201121:174736.821 slow query: 50.972487 sec, "update hosts set lastaccess=1605977205 where
    2707:20201121:174746.822 slow query: 51.279763 sec, "update hosts set lastaccess=1605977215 where
    2711:20201121:174810.826 slow query: 50.953701 sec, "update hosts set lastaccess=1605977215 where
    2710:20201121:174831.830 slow query: 50.974593 sec, "update hosts set lastaccess=1605977239 where
    2707:20201121:174837.831 slow query: 51.000960 sec, "update hosts set lastaccess=1605977266 where
    2711:20201121:174905.835 slow query: 50.981940 sec, "update hosts set lastaccess=1605977266 where
    2710:20201121:174934.840 slow query: 50.872737 sec, "update hosts set lastaccess=1605977323 where
    2707:20201121:174940.843 slow query: 51.220452 sec, "update hosts set lastaccess=1605977329 where
    2711:20201121:174956.847 slow query: 50.989090 sec, "update hosts set lastaccess=1605977329 where
    2710:20201121:175025.848 slow query: 50.991834 sec, "update hosts set lastaccess=1605977374 where
    2707:20201121:175031.849 slow query: 50.955785 sec, "update hosts set lastaccess=1605977374 where
    2610:20201121:175109.262 slow query: 893.309287 sec, "update httptest set nextcheck=1605976635 whe
    2624:20201121:175109.263 slow query: 893.405524 sec, "insert into alerts (alertid,actionid,eventid
    2619:20201121:175109.263 slow query: 952.597638 sec, "[COLOR=#d35400]insert into history_uint[/COLOR] (itemid,clock,ns,va
    2679:20201121:175109.263 slow query: 237.390040 sec, "update hosts set errors_from=1605977231,disa
    2611:20201121:175109.263 slow query: 894.416493 sec, "update httptest set nextcheck=1605976634 whe
    2705:20201121:175109.264 slow query: 886.395501 sec, "update hosts set snmp_disable_until=16059766
    2703:20201121:175109.265 slow query: 886.408132 sec, "update hosts set snmp_disable_until=16059766
    2698:20201121:175109.265 slow query: 886.394021 sec, "update hosts set snmp_disable_until=16059766
    2697:20201121:175109.336 slow query: 886.393064 sec, "update hosts set snmp_disable_until=16059766
    2706:20201121:175109.336 slow query: 886.480356 sec, "update hosts set snmp_disable_until=16059766
    2699:20201121:175109.336 slow query: 886.399441 sec, "update hosts set snmp_disable_until=16059766
    2700:20201121:175109.340 slow query: 886.474635 sec, "update hosts set snmp_disable_until=16059766
    2745:20201121:175109.340 slow query: 908.271874 sec, "update item_discovery set lastcheck=16059765
    2614:20201121:175109.340 slow query: 895.224497 sec, "update httptest set nextcheck=1605976634 whe
    2621:20201121:175109.341 slow query: 1262.306755 sec, "[COLOR=#d35400]insert into history_uint[/COLOR] (itemid,clock,ns,v
    2612:20201121:175109.341 slow query: 894.341787 sec, "update httptest set nextcheck=1605976634 whe
    2615:20201121:175109.342 slow query: 894.664855 sec, "update httptest set nextcheck=1605976634 whe
    2617:20201121:175109.351 slow query: 893.716039 sec, "update httptest set nextcheck=1605976635 whe
    2623:20201121:175109.352 slow query: 952.742680 sec, "[COLOR=#d35400]insert into history_uint[/COLOR] (itemid,clock,ns,va
    2622:20201121:175109.394 slow query: 952.758291 sec, "[COLOR=#d35400]insert into history_uint[/COLOR] (itemid,clock,ns,va
    2744:20201121:175109.394 slow query: 906.534411 sec, "update item_discovery set lastcheck=16059765
    2702:20201121:175109.395 slow query: 886.530080 sec, "update hosts set snmp_disable_until=16059766
    2709:20201121:175109.395 slow query: 797.534607 sec, "update hosts set lastaccess=1605976671 where
    2707:20201121:175109.402 slow query: 33.199059 sec, "update hosts set lastaccess=1605977436 where
    2710:20201121:175109.408 slow query: 27.171134 sec, "update hosts set lastaccess=1605977442 where
    2711:20201121:175109.413 slow query: 39.366580 sec, "update hosts set lastaccess=1605977430 where
    2704:20201121:175109.417 slow query: 886.551725 sec, "update hosts set snmp_disable_until=16059766
    2708:20201121:175109.422 slow query: 906.277752 sec, "update hosts set lastaccess=1605976563 where
    2701:20201121:175109.423 slow query: 886.557412 sec, "update hosts set snmp_disable_until=16059766
    2609:20201121:175109.424 slow query: 895.358801 sec, "update httptest set nextcheck=1605976634 whe
    2613:20201121:175109.424 slow query: 894.542305 sec, "update httptest set nextcheck=1605976634 whe
    We are investigating why the three of four red "insert into history_uint" are performed by zabbix at the same time, but we do not understand what is going on.

    Comment

    • nickcn
      Junior Member
      • Nov 2020
      • 2

      #3
      Hello,

      Have you guys found any solution for this?
      I have the same problem with Zabbix 4.0.27 and MariaDB 10.3.27.
      I believe the problem started a couple of MariaDB updates before.

      This is what's happening:

      Click image for larger version  Name:	12-06-20_23_28_34.jpg Views:	0 Size:	266.1 KB ID:	414686

      I have also found out that during this all my proxies show messages similar to the one below:

      Code:
      1190:20201206:223128.989 active check configuration update from [127.0.0.1:10051] started to fail (ZBX_TCP_READ() timed out)
      1190:20201206:224038.086 active check configuration update from [127.0.0.1:10051] is working again
      Last edited by nickcn; 06-12-2020, 23:39.

      Comment

      • Guntis
        Junior Member
        • Mar 2019
        • 18

        #4
        I highly recommend watching this video to better understand Zabbix performance issues and how to read those performance charts - https://www.youtube.com/watch?v=EIz3HBfMTHo

        I tried tuning MySQL performance parameters, but it didn't help much. Doing queries on `history_uint` very long time. I tried lowering history period to 2 weeks for all items and even manually deleting older entries from that table, but it took endless hours.
        I was planning to enable MySQL partitioning on these tables (which hopefully would improve performance on this talble), but server was low on free disk space (apparently enabling partitioning rqeuires copying all the data to temporary db).

        Yet after all this effort I decided to go the destructive way - dropping the `history_uint` table and creating a new one. Since I don't care that much about historical values this turned out to be great solution at least for now it looks like.
        Anyway I plan to enable partitioning to prevent this issue in future.

        Comment

        • martinm_76
          Member
          • Mar 2015
          • 57

          #5
          I seem to have the same problem. MariaDB 10.4.17. Was on 10.4.13 previously, but it leaked memory. Now I have this issue in stead. Can those inserts be split up or something? I am not a fan of MariaDB intentionally crashing itself every few days, seemingly due to giant insert statements (other things may be involved,, but that certainnly seems the most prominent).

          Comment

          • Guntis
            Junior Member
            • Mar 2019
            • 18

            #6
            Well I can confirm that my instance has been running perfectly smooth since I set up partitioning and disabled internal housekeeping for history and trends tables.
            The real issue was the size of history_uint table, causing queries to execute extremely slowly and likely pile up.
            Since now that table contains only data for a single day (each day data is on separate partition (table)), it's way faster to do queries on it.
            Also while doing maintenance I migrated all data from MariaDB to a new MySQL 8 server, no issues with that either, running default config. Just had to disable binary log to prevent running out of disk space
            One day of history data for me is around 2.5GB with 307 NVPS.

            Comment

            • Tariq Hasan
              Junior Member
              • Nov 2018
              • 27

              #7
              Hi
              I have the same issue - I think.

              [root@zabbix-server]# mysql -V
              mysql Ver 15.1 Distrib 10.3.27-MariaDB, for Linux (x86_64) using readline 5.1

              root@zabbix-server]# yum list installed| grep zab
              fping.x86_64 3.16-1.el8 @zabbix-non-supported
              zabbix-agent.x86_64 5.0.4-1.el8 @zabbix
              zabbix-apache-conf.noarch 5.0.4-1.el8 @zabbix
              zabbix-release.noarch 5.0-1.el8 @System
              zabbix-server-mysql.x86_64 5.0.4-1.el8 @zabbix
              zabbix-web.noarch 5.0.4-1.el8 @zabbix
              zabbix-web-deps.x86_64 5.0.4-1.el8 @zabbix
              zabbix-web-mysql.noarch 5.0.4-1.el8 zabbix

              Extract from zabbix-server.log from last message storm:

              1638:20210420:115914.109 slow query: 6.650204 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (59718,1618912800,59,0,0,0),(81988,1618912800,59,0 ,0,0),(47285,1618912800,59,0,0,0),(65293,161891280 0,19,214960,1214280,10163336),(39195,1618912800,59 ,0,0,0),(75001,1618912800,60,1073741819904,1073741 819904,1073741819904),(50035,1618912800,59,0,0,0), (46453,1618912800,59,0,0,0),(69228,1618912800,59,0 ,0,0),(61883,1618912800,59,0,0,0),(74534,161891280 0,59,0,0,0),(80149,1618912800,59,0,0,0),(79501,161 8912800,60,20485120,20485120,20485120),(47105,1618 912800,59,0,0,0),(73783,1618912800,59,0,0,0),(5709 1,1618912800,59,6018338816,7299691329,13174972416) ,(78288,1618912800,59,960880640,960880640,96088064 0),(86285,1618912800,59,0,0,0),(53970,1618912800,5 9,0,0,0),(41396,1618912800,59,0,0,0),(34517,161891 2800,59,16042635264,16042635264,16042635264),(5594 6,1618912800,59,0,0,0),(73899,1618912800,59,0,0,0) ,(83585,1618912800,59,0,0,0),(34211,1618912800,20, 0,0,0),(86258,1618912800,59,5381480448,5425795557, 5517275136),(36350,1618912800,59,4231622656,423162 2656,4231622656),(43913,1618912800,59,0,0,0),(8667 4,1618912800,59,0,0,0),(39042,1618912800,59,0,0,0) ,(39162,1618912800,59,0,0,0),(55384,1618912800,59, 0,0,0),(56083,1618912800,59,0,0,0),(53309,16189128 00,59,0,0,0),(47345,1618912800,59,0,0,0),(88916,16 18912800,59,0,0,0),(49905,1618912800,59,1607047168 0,16070471680,16070471680),(43090,1618912800,59,82 74931712,8274931712,8274931712),(53338,1618912800, 59,0,0,0),(35200,1618912800,20,0,0,0),(42403,16189 12800,59,61355687936,61356309556,61356789760),(815 52,1618912800,59,5274324992,5293828130,5323448320) ,(85865,1618912800,59,0,0,0),(54282,1618912800,59, 0,0,0),(37365,1618912800,59,0,0,0),(85731,16189128 00,59,0,0,0),(79153,1618912800,59,135801856,135801 856,135801856),(72364,1618912800,59,137079500800,1 37079500800,137079500800),(78381,1618912800,59,379 051008,379051008,379051008),(78723,1618912800,59,1 7754112,17 ... and on and on .....

              Graphs show as attached:

              Usually discoverer process or discoverer data collector causing issues, as the timings coincide.

              Thanks

              Tariq
              Attached Files

              Comment

              • jeev4
                Junior Member
                • Oct 2024
                • 1

                #8
                I've having the exact same issue after I upgraded from zabbix 6.0 LTS to zabbix 7 LTS and upgrading to 11.5.2-MariaDB

                Comment

                Working...