I'm having a problem where my history syncer processes are going to 100% and random times. When this occurs, it seems that they are in a state where they can't catch up, filling the history cache and at some point they either eventually recover, or I have to kill the zabbix process entirely. I have everything tuned as optimally as possible (everything runs on own device, plenty of hardware, mysql tuned, etc.). I have the large tables partitioned, and have disabled housekeeping for history/trends. I don't believe it's housekeeping because when it occurs the housekeeper has not run. I immediately seen slow queries both select and update in the zabbix_server.log. Are there any debugging or steps that I can take to gather more information from zabbix to see what may be causing this to happen? I am on release 3.4.7
Ad Widget
Collapse
History Syncer 100%
Collapse
X
-
Tags: None
-
How many history syncer processes are you running? You may need to have more. However, if you're seeing slow queries, you may have a DB performance issue. How large are your history_* tables? What kind of resources does your DB server have? -
Each of these runs within VMware Vsphere running Ubuntu 16.04.3 LTS Status of Zabbix
ZABBIX SERVERZabbix server is running Yes wtc2zasv01:10051 Number of hosts (enabled/disabled/templates) 979 886 / 4 / 89 Number of items (enabled/disabled/not supported) 382546 382072 / 252 / 222 Number of triggers (enabled/disabled [problem/ok]) 172036 171839 / 197 [150 / 171689] Number of users (online) 11 2 Required server performance, new values per second 2062.4
4 CPU
32 GB RAM
root@wtc2zasv01:/etc/zabbix# cat zabbix_server.conf
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=0
PidFile=/var/run/zabbix/zabbix_server.pid
SocketDir=/var/run/zabbix
DBHost = wtc2zadb02
DBName=zabbix
DBUser=zabbix
DBPassword = ########
StartPollers = 10
StartPollersUnreachable = 10
StartPingers = 10
StartDiscoverers = 10
SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
CacheSize = 2G
HistoryCacheSize = 256M
HistoryIndexCacheSize = 256M
TrendCacheSize = 256M
ValueCacheSize = 2G
Timeout=4
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
FpingLocation=/usr/bin/fping
Fping6Location=/usr/bin/fping6
#LogSlowQueries=3000
LogSlowQueries=10000
StartProxyPollers=20
ProxyConfigFrequency = 300
ProxyDataFrequency = 1
StartDBSyncers = 24
DB SERVER
12 CPU
64 GP RAM
DB is replicating to a hot standby using Master-Master replication
Housekeeper is disabled for History and Trends
All large tables are partitioned (history daily, trends monthly)
/var/lib/mysql - is on an EMC VMax storage SAN with 2TB of disk allocated
mysqld.cnf
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
#
# * Basic Settings
#
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
skip-name-resolve
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
#bind-address = 127.0.0.1
#
# * Fine Tuning
#
key_buffer_size = 16M
max_allowed_packet = 64M
thread_stack = 192K
thread_cache_size = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam-recover-options = BACKUP
#max_connections = 100
#table_cache = 64
#thread_concurrency = 10
#
# * Query Cache Configuration
#
query_cache_limit = 1M
query_cache_size = 0
query_cache_type = 0
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file = /var/log/mysql/mysql.log
#general_log = 1
#
# Error log - should be very few entries.
#
log_error = /var/log/mysql/error.log
#
# Here you can see queries with especially long duration
#log_slow_queries = /var/log/mysql/mysql-slow.log
#long_query_time = 2
#log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
# other settings you may need to change.
#server-id = 1
#log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 10
max_binlog_size = 100M
#binlog_do_db = include_database_name
#binlog_ignore_db = include_database_name
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
innodb_buffer_pool_size = 48G
innodb_buffer_pool_instances = 16
innodb_log_file_size=8G
innodb_lru_scan_depth=256
#innodb_io_capacity = 2000
innodb_io_capacity = 10000
#
# * Security Features
#
# Read the manual, too, if you want chroot!
# chroot = /var/lib/mysql/
#
# For generating SSL certificates I recommend the OpenSSL GUI "tinyca".
#
# ssl-ca=/etc/mysql/cacert.pem
# ssl-cert=/etc/mysql/server-cert.pem
# ssl-key=/etc/mysql/server-key.pem
max_connections = 500
optimizer_switch = 'index_condition_pushdown=off'
server_id = 2
log-bin="mysql-bin"
binlog-do-db=zabbix
binlog-ignore-db=information_schema
binlog-ignore-db=mysql
replicate-ignore-db=test
replicate-ignore-db=information_schema
replicate-ignore-db=mysql
relay-log="mysql-relay-log"
auto-increment-increment = 2
auto-increment-offset = 2
I'm definitely looking at the DB as a bottleneck, but I'm trying to determine exactly what are the history syncers doing when this problem occurs. I've logged query activity during this time, and Select and Updates become horribly slow (I can take an update that is logged as a slow query and run it in my standby db and it runs instantly). However, I would think if the DB is over run, the problem would persist after Zabbix was shutdown and restarted. I sometimes have to kill the Zabbix parent process to completely to kick it free (and losing all the data in the history cache), and then after a restart the Proxies all start to plow the data they have been collecting and storing locally back to Zabbix server. When this occurs the syncers actually keep up fine (along with the DB) and the queue will settle itself out within about 5-10 mins (and the rest of the system seems ok).
Last edited by [email protected]; 11-04-2018, 23:31.Comment
-
Also, my config sync process constantly runs around 35% and nothing i've tried seems to make that change, not sure if that is related?Comment
Comment