Ad Widget

Collapse

Zabbix Box Unstable After Platform Migration/Upgrade to 6.0

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • alexw-z
    Member
    • Dec 2021
    • 36

    #1

    Zabbix Box Unstable After Platform Migration/Upgrade to 6.0

    I've got an issue whereby after migration and upgrade of my Zabbix install the server process is repeatedly crashing 8-24 hours after being started.

    The old box was Ubuntu 16, it was upgraded in place to 4.4.10 previously, then at the start of this week the DB and relevant config files migrated to an Alma 8 box running a clean install of 4.4.10, before running the upgrade to 6.0.

    This was tested extensively in a Sandbox environment and also on Pre-Prod without encountering any problems. It's not a massive installation, we monitor perhaps 800-1000 servers, and the box has resource to spare. The new box has 6vCPUs and 60Gbytes of RAM which should be plenty.

    The following error gets logged at the point the processes hang:

    61610:20230223:143112.196 Got signal [signal:15(SIGTERM),sender_pid:936418,sender_uid:0, reason:0]. Exiting ...
    61611:20230223:143112.854 HA manager has been paused
    61690:20230223:143112.859 syncing history data in progress...
    61897:20230223:143114.011 cannot write to IPC socket: Broken pipe
    61897:20230223:143114.011 cannot send data to availability manager service
    61905:20230223:143114.043 cannot write to IPC socket: Broken pipe
    61905:20230223:143114.043 cannot send data to availability manager service
    61910:20230223:143114.081 cannot write to IPC socket: Broken pipe
    61910:20230223:143114.081 cannot send data to availability manager service
    61918:20230223:143116.095 cannot write to IPC socket: Broken pipe
    61918:20230223:143116.095 cannot send data to availability manager service


    From checking previous forum posts there is a suggestion that the above error can be caused by processes hitting an Open Files ulimit. I've increased the ulimit for the Zabbix user from 1024 to 4096 in /etc/security/limits.conf and this looks to have taken effect.

    [root@dca-zabbix zabbix]# su - zabbix -c 'ulimit -aHS' -s '/bin/bash' | grep open
    open files (-n) 4096

    ​[root@dca-zabbix zabbix]# cat /proc/945164/limits | grep open (Current zabbix_server parent PID).
    Max open files 4096 4096 files


    It seems unusual to be brushing up against ulimit problems in this day and age however, and the old box additionally also has the standard Linux ulimit of 1024, so I'm not convinced this is the smoking gun.​

    MariaDB is also reporting some of the following errors.

    Feb 23 14:59:39 dca-zabbix mariadbd[5051]: 2023-02-23 14:59:39 179917 [Warning] Aborted connection 179917 to db: 'zabbix' user: 'zabbix' host: 'localhost' (Got timeout reading communication packets)
    Feb 23 15:00:29 dca-zabbix mariadbd[5051]: 2023-02-23 15:00:29 179916 [Warning] Aborted connection 179916 to db: 'zabbix' user: 'zabbix' host: 'localhost' (Got timeout reading communication packets)


    I've increased the max_allowed_packet variable in MariaDB from it's previous default (carried over from the old box) of 16M to 256M which appears to have greatly lessen the frequency of these errors, although there are still the odd burst. These are however also present on the Pre-Prod box and haven't crashed the box, so I'm not certain this isn't a red herring either at the moment.

    The DB was mysqldumped from MariaDB 10.2, and imported into MariaDB10.5 on the new box. I've attached a file containing my current zabbix_server.conf and MariaDB server.cnf files.

    Can anybody please assist with where to take troubleshooting next?

    Many thanks.
    Attached Files
  • alexw-z
    Member
    • Dec 2021
    • 36

    #2
    For anybody else reading this in future, I eventually traced this back to the History Cache filling up, although even with the updated 6.0 template for monitoring the Zabbix Server itself applied, at no point did the history write cache go higher than 60% on the monitoring graphs, which is much of the reason this took so long to identify.

    I was monitoring the queue via the GUI and I spotted that the Preprocessing Manager had failed shortly before everything else hung. From there, I selectively increased the debug log levels on that PID (zabbix_server -R log_level_increase=xxxx) and found a "History cache is full" error.

    I carried over a HistoryCache size of 64GB from 4.4, and it would seem that this needed to be at least 512MB post upgrade (I stepped from 256M to 1GB). I believe from subsequent investigation the default HistoryCache setting has increased with 6.0, but I've not seen any advice on any upgrade tutorials anywhere that specifies you should be aware of this setting during upgrades.​
    Last edited by alexw-z; 09-03-2023, 20:50.

    Comment

    Working...