I've got an issue whereby after migration and upgrade of my Zabbix install the server process is repeatedly crashing 8-24 hours after being started.
The old box was Ubuntu 16, it was upgraded in place to 4.4.10 previously, then at the start of this week the DB and relevant config files migrated to an Alma 8 box running a clean install of 4.4.10, before running the upgrade to 6.0.
This was tested extensively in a Sandbox environment and also on Pre-Prod without encountering any problems. It's not a massive installation, we monitor perhaps 800-1000 servers, and the box has resource to spare. The new box has 6vCPUs and 60Gbytes of RAM which should be plenty.
The following error gets logged at the point the processes hang:
61610:20230223:143112.196 Got signal [signal:15(SIGTERM),sender_pid:936418,sender_uid:0, reason:0]. Exiting ...
61611:20230223:143112.854 HA manager has been paused
61690:20230223:143112.859 syncing history data in progress...
61897:20230223:143114.011 cannot write to IPC socket: Broken pipe
61897:20230223:143114.011 cannot send data to availability manager service
61905:20230223:143114.043 cannot write to IPC socket: Broken pipe
61905:20230223:143114.043 cannot send data to availability manager service
61910:20230223:143114.081 cannot write to IPC socket: Broken pipe
61910:20230223:143114.081 cannot send data to availability manager service
61918:20230223:143116.095 cannot write to IPC socket: Broken pipe
61918:20230223:143116.095 cannot send data to availability manager service
From checking previous forum posts there is a suggestion that the above error can be caused by processes hitting an Open Files ulimit. I've increased the ulimit for the Zabbix user from 1024 to 4096 in /etc/security/limits.conf and this looks to have taken effect.
[root@dca-zabbix zabbix]# su - zabbix -c 'ulimit -aHS' -s '/bin/bash' | grep open
open files (-n) 4096
[root@dca-zabbix zabbix]# cat /proc/945164/limits | grep open (Current zabbix_server parent PID).
Max open files 4096 4096 files
It seems unusual to be brushing up against ulimit problems in this day and age however, and the old box additionally also has the standard Linux ulimit of 1024, so I'm not convinced this is the smoking gun.
MariaDB is also reporting some of the following errors.
Feb 23 14:59:39 dca-zabbix mariadbd[5051]: 2023-02-23 14:59:39 179917 [Warning] Aborted connection 179917 to db: 'zabbix' user: 'zabbix' host: 'localhost' (Got timeout reading communication packets)
Feb 23 15:00:29 dca-zabbix mariadbd[5051]: 2023-02-23 15:00:29 179916 [Warning] Aborted connection 179916 to db: 'zabbix' user: 'zabbix' host: 'localhost' (Got timeout reading communication packets)
I've increased the max_allowed_packet variable in MariaDB from it's previous default (carried over from the old box) of 16M to 256M which appears to have greatly lessen the frequency of these errors, although there are still the odd burst. These are however also present on the Pre-Prod box and haven't crashed the box, so I'm not certain this isn't a red herring either at the moment.
The DB was mysqldumped from MariaDB 10.2, and imported into MariaDB10.5 on the new box. I've attached a file containing my current zabbix_server.conf and MariaDB server.cnf files.
Can anybody please assist with where to take troubleshooting next?
Many thanks.
The old box was Ubuntu 16, it was upgraded in place to 4.4.10 previously, then at the start of this week the DB and relevant config files migrated to an Alma 8 box running a clean install of 4.4.10, before running the upgrade to 6.0.
This was tested extensively in a Sandbox environment and also on Pre-Prod without encountering any problems. It's not a massive installation, we monitor perhaps 800-1000 servers, and the box has resource to spare. The new box has 6vCPUs and 60Gbytes of RAM which should be plenty.
The following error gets logged at the point the processes hang:
61610:20230223:143112.196 Got signal [signal:15(SIGTERM),sender_pid:936418,sender_uid:0, reason:0]. Exiting ...
61611:20230223:143112.854 HA manager has been paused
61690:20230223:143112.859 syncing history data in progress...
61897:20230223:143114.011 cannot write to IPC socket: Broken pipe
61897:20230223:143114.011 cannot send data to availability manager service
61905:20230223:143114.043 cannot write to IPC socket: Broken pipe
61905:20230223:143114.043 cannot send data to availability manager service
61910:20230223:143114.081 cannot write to IPC socket: Broken pipe
61910:20230223:143114.081 cannot send data to availability manager service
61918:20230223:143116.095 cannot write to IPC socket: Broken pipe
61918:20230223:143116.095 cannot send data to availability manager service
From checking previous forum posts there is a suggestion that the above error can be caused by processes hitting an Open Files ulimit. I've increased the ulimit for the Zabbix user from 1024 to 4096 in /etc/security/limits.conf and this looks to have taken effect.
[root@dca-zabbix zabbix]# su - zabbix -c 'ulimit -aHS' -s '/bin/bash' | grep open
open files (-n) 4096
[root@dca-zabbix zabbix]# cat /proc/945164/limits | grep open (Current zabbix_server parent PID).
Max open files 4096 4096 files
It seems unusual to be brushing up against ulimit problems in this day and age however, and the old box additionally also has the standard Linux ulimit of 1024, so I'm not convinced this is the smoking gun.
MariaDB is also reporting some of the following errors.
Feb 23 14:59:39 dca-zabbix mariadbd[5051]: 2023-02-23 14:59:39 179917 [Warning] Aborted connection 179917 to db: 'zabbix' user: 'zabbix' host: 'localhost' (Got timeout reading communication packets)
Feb 23 15:00:29 dca-zabbix mariadbd[5051]: 2023-02-23 15:00:29 179916 [Warning] Aborted connection 179916 to db: 'zabbix' user: 'zabbix' host: 'localhost' (Got timeout reading communication packets)
I've increased the max_allowed_packet variable in MariaDB from it's previous default (carried over from the old box) of 16M to 256M which appears to have greatly lessen the frequency of these errors, although there are still the odd burst. These are however also present on the Pre-Prod box and haven't crashed the box, so I'm not certain this isn't a red herring either at the moment.
The DB was mysqldumped from MariaDB 10.2, and imported into MariaDB10.5 on the new box. I've attached a file containing my current zabbix_server.conf and MariaDB server.cnf files.
Can anybody please assist with where to take troubleshooting next?
Many thanks.
Comment