I recently upgraded Zabbix to 5.0 (from 4.0) and as such needed to upgrade MariaDB from 5.5 (Default for CentOS 7 base repo) to 10.2.32 (from the MariaDB repo).
Since upgrading we've had a number of performance-like problems, but I've not been able to find anything useful in zabbix_server.log or mariadb.log that points to the cause.
Problems include:
- Seeing the 'Zabbix server is not running' error message on the web front end (I've checked that zabbix_server processes are running, no errors logged to the server log file)
- Having some (or occasionally all) proxies report that they're unable to connect to the Zabbix Server due to TCP errors: "cannot send heartbeat message to server at "zabbix.daraco.com.au": ZBX_TCP_READ() timed out"
- I'm also seeing occasional SQL write error messages dumped as errors on the web GUI in various locations
- Periods where the history write cache is filled every hour - for about 10-15 minutes (this sometimes correlates to a big jump in the Zabbix queue to 100K of items in the queue see attached graph)
These issues were not present before we did the version upgrade.
To address the problem I've attempted the following:
- Increasing the allocated physical RAM on the host by 4GB (since doing this I've seen my load averages decrease by about 30% and typical CPU I/o wait times decrease a little too)
- Increasing the allocated RAM setting on innodb_buffer_pool_size from 8G to 10G
- Investigated changing the innodb_io_capacity setting from the default of 200, but this is being deprecated in MariaDB 10.5, so I'm not sure of its effectiveness in the MariaDB 10.2??
- Increased the following Zabbix server parameters:
-- CacheSize from 384 to 512M
-- HistoryCacheSize from 24 to 32M
The 'Value Cache effectiveness' graph shows relatively few misses, so I'm thinking that I've got the cacheing parameters set correctly.
Some basic details about the server:
- 6 cores
- 24GB RAM (as it's running a few other things as well as Zabbix/MariaDB)
- MySQL data store is close to 800GB (history_uint table size is 263GB, events about 180GB). /var/lib/mysql has its own dedicated volume.
If I do a show processlist at the mysql CLI, I see results including this:
| Id | User | Host | db | Command | Time | State | Info | Progress |
| 2474 | zabbix_server | localhost | zabbix_prod | Query | 5801 | Sending data | SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,e r1.r_eventid FROM events e LEFT JO | 0.000 |
| 2475 | zabbix_server | localhost | zabbix_prod | Query | 5801 | Sending data | SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,e r1.r_eventid FROM events e LEFT JO | 0.000 |
Execution times of over 5801 (ie. an hour and a half) aren't ideal, so I'm imagining something is up there that will be causing performance bottlenecks, but I'm not sure how this would compare to other Zabbix instances.
TLDR;
I'm wondering there were significant changes to MariaDB between 5.5 and 10.2 that I need to tweak configuration for, or if there are things I can do to the Zabbix configuration to improve performance.
At this stage, I expect that my MariaDB config will need tweaking, but I'm not really sure where to start - given things were going quite well prior to the upgrade. I'm making this assumption on the 3rd problem listed above. However, since I'm not seeing it reproduced by any other process or task, I'm unsure of the foundations of this thought.
Since upgrading we've had a number of performance-like problems, but I've not been able to find anything useful in zabbix_server.log or mariadb.log that points to the cause.
Problems include:
- Seeing the 'Zabbix server is not running' error message on the web front end (I've checked that zabbix_server processes are running, no errors logged to the server log file)
- Having some (or occasionally all) proxies report that they're unable to connect to the Zabbix Server due to TCP errors: "cannot send heartbeat message to server at "zabbix.daraco.com.au": ZBX_TCP_READ() timed out"
- I'm also seeing occasional SQL write error messages dumped as errors on the web GUI in various locations
- Periods where the history write cache is filled every hour - for about 10-15 minutes (this sometimes correlates to a big jump in the Zabbix queue to 100K of items in the queue see attached graph)
These issues were not present before we did the version upgrade.
To address the problem I've attempted the following:
- Increasing the allocated physical RAM on the host by 4GB (since doing this I've seen my load averages decrease by about 30% and typical CPU I/o wait times decrease a little too)
- Increasing the allocated RAM setting on innodb_buffer_pool_size from 8G to 10G
- Investigated changing the innodb_io_capacity setting from the default of 200, but this is being deprecated in MariaDB 10.5, so I'm not sure of its effectiveness in the MariaDB 10.2??
- Increased the following Zabbix server parameters:
-- CacheSize from 384 to 512M
-- HistoryCacheSize from 24 to 32M
The 'Value Cache effectiveness' graph shows relatively few misses, so I'm thinking that I've got the cacheing parameters set correctly.
Some basic details about the server:
- 6 cores
- 24GB RAM (as it's running a few other things as well as Zabbix/MariaDB)
- MySQL data store is close to 800GB (history_uint table size is 263GB, events about 180GB). /var/lib/mysql has its own dedicated volume.
If I do a show processlist at the mysql CLI, I see results including this:
| Id | User | Host | db | Command | Time | State | Info | Progress |
| 2474 | zabbix_server | localhost | zabbix_prod | Query | 5801 | Sending data | SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,e r1.r_eventid FROM events e LEFT JO | 0.000 |
| 2475 | zabbix_server | localhost | zabbix_prod | Query | 5801 | Sending data | SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,e r1.r_eventid FROM events e LEFT JO | 0.000 |
Execution times of over 5801 (ie. an hour and a half) aren't ideal, so I'm imagining something is up there that will be causing performance bottlenecks, but I'm not sure how this would compare to other Zabbix instances.
TLDR;
I'm wondering there were significant changes to MariaDB between 5.5 and 10.2 that I need to tweak configuration for, or if there are things I can do to the Zabbix configuration to improve performance.
At this stage, I expect that my MariaDB config will need tweaking, but I'm not really sure where to start - given things were going quite well prior to the upgrade. I'm making this assumption on the 3rd problem listed above. However, since I'm not seeing it reproduced by any other process or task, I'm unsure of the foundations of this thought.
Comment