Ad Widget

Collapse

Zabbix Server Failing Frequently – Database Errors and Slow Queries

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • commandLineConsul
    Junior Member
    • Jan 2025
    • 2

    #1

    Zabbix Server Failing Frequently – Database Errors and Slow Queries

    Hello all,

    I would like to get your help and opinions on the following problem:
    Our Zabbix server suddenly started critically failing, and since then, it has become a regular occurrence. Initially, the GUI wouldn't load and was stuck on "Loading." After a while, the browser displayed the message:

    Database error: No such file or directory.

    Another time, the error message was:

    Database error: SQL statement execution has failed
    "UPDATE sessions SET lastaccess='....' WHERE sessionid='.....'"

    From the server logs, we see errors like these:

    Code:
    4138:20250116:144301.926 slow query: 609.183770 sec, "update ha_node set lastaccess=unix_timestamp() where ha_nodeid='......'"
    4138:20250116:144301.926 [Z3005] query failed: [2006] Server has gone away [rollback;]
    4212:20250116:144301.927 [Z3001] connection to database 'zabbix' failed: [2013] Lost connection to server at 'handshake: reading initial communication packet', system error: 104
    4212:20250116:144301.927 database is down: reconnecting in 10 seconds
    4151:20250116:144301.927 slow query: 1010.602710 sec, "delete from history where itemid=116116 and clock<1736399235"
    We tried increasing the parameters and improving the performance of our MariaDB instance (see updated config below).

    However, we cannot find much useful information in the logs. One common point is that before the service starts failing, we see a lot of slow query logs related to the history_uint table, such as:

    Code:
    657173:20250128:182651.191 slow query: 10905.365435 sec, "insert into history_uint (itemid,clock,ns,value) values (42245,1738074305,815327355,8323534848),(43805,173 8074305,8163...
    ...
    ...
    But we generally can't pinpoint the root cause.

    There is plenty of free disk space, RAM, and CPU available. The issue seems database-related, but it originates from Zabbix. When it gets stuck, only kill -9 works (and not always).

    When the Zabbix process is stuck, we can still access the database, but it does not react to any Zabbix-related queries (e.g., SELECT statements). However, it does respond to internal database commands like SHOW PROCESSLIST;.

    Environment:

    OS: Debian Linux 12.9
    Zabbix Server Version: 7.0.9
    MariaDB Version: 10.11.6
    Hosts: 164
    Items: 29,397
    Triggers: 13,524
    New Values per Second: 444

    MariaDB Configuration:

    Code:
    [mysqld]
    pid-file = /run/mysqld/mysqld.pid
    basedir = /usr
    bind-address = 127.0.0.1
    key_buffer_size = 10M
    max_allowed_packet = 256M
    myisam_recover_options = BACKUP
    max_connections = 600
    query_cache_limit = 1M
    query_cache_size = 256M
    tmp_table_size = 256M
    max_heap_table_size = 256M
    performance_schema = OFF
    log_error = /var/log/mysql/error.log
    log_slow_query_file = /var/log/mysql/mariadb-slow.log
    log_slow_query_time = 10
    
    character-set-server = utf8mb4
    collation-server = utf8mb4_general_ci
    
    innodb_buffer_pool_size = 9G
    innodb_read_io_threads = 8
    innodb_write_io_threads = 8
    innodb_io_capacity = 2000
    innodb_file_per_table = ON
    innodb_default_row_format = DYNAMIC
    innodb_lock_wait_timeout = 100
    innodb_flush_method = O_DIRECT
    innodb_flush_log_at_trx_commit = 0
    Zabbix Server Configuration:

    Code:
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=0
    DebugLevel=3
    PidFile=/run/zabbix/zabbix_server.pid
    SocketDir=/run/zabbix
    DBHost=localhost
    DBName=xxxxxx
    DBUser=xxxxxx
    DBPassword=xxxxxxxx
    DBSocket=/run/mysqld/mysqld.sock
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    HousekeepingFrequency=1
    MaxHousekeeperDelete=50000
    CacheSize=512M
    CacheUpdateFrequency=10
    HistoryCacheSize=1024M
    HistoryIndexCacheSize=512M
    TrendCacheSize=64M
    ValueCacheSize=256M
    Timeout=4
    FpingLocation=/usr/bin/fping
    Fping6Location=/usr/bin/fping6
    LogSlowQueries=3000
    StartLLDProcessors=1
    StatsAllowedIP=127.0.0.1
    EnableGlobalScripts=0
    Does anyone have any insights or ideas on what might be causing these failures?

    Any advice would be greatly appreciated!
  • commandLineConsul
    Junior Member
    • Jan 2025
    • 2

    #2
    Thanks, tried it but there isn't really anything problematic there.
    I have found similar cases during my research of this topic and there is no definitive solution. It is a really strange problem. And the source of all seem to be the db. We'll be migrating to psql.

    Comment

    • PavelZ
      Senior Member
      • Dec 2024
      • 162

      #3

      innodb_buffer_pool_size = 9G
      innodb_flush_log_at_trx_commit = 0
      These are the most important settings. And it looks like you've already tried to adjust them.


      query_cache_size = 256M
      This is usually a lot for an application like Zabbix. I suggest 16M.​


      Zabbix works with the database primitively enough that it is possible to fix something with magic settings. I suggest using a script for mysql partitioning like this https://github.com/OpensourceICTSolu...titioning-perl
      You will definitely get rid of the housekeeper requests that we see here in the data you provided.​
      Here is query of Housekeeper process:
      4151:20250116:144301.927 slow query: 1010.602710 sec, "delete from history where itemid=116116 and clock<1736399235"
      Usually it influences everything else.

      Comment

      Working...