We have about 25,000 active items and need 60/second to keep up, which normally runs fine. But when we get connectivity problems such as today, the server can't connect and the queue climbs up, of course, in this case to 18000.
But when connectivity came back, the queue didn't go down, even though the DB showed over 675 updates/second and 1400 read/second - the DB was I/O saturated on log syncs. Changing log syncs (mysql sync on commit to 0) raised transaction level to 5000/second and the queue drained in less than a minute.
My question is two-fold:
1) The queue is a bit fuzzy to me but I assume is the number of items past their check time. If this is true and the server is checking more than the 60/second needed to keep up (was doing 1200/sec) why did the queue keep rising? I assume when an item is checked the next check time is set in the future (now + interval), not last scheduled check time plus interval.
2) One of the great 1.8 performance improvements seems to be batch updates, so for 60 updates/second, I only see 1-2 log syncs normally, indicating Zabbix is doing lots of upates/inserts and then committing. But during this high queue time, 675 updates/sec gave me 675 DB sync/sec (actually 1300, for log and binlog) - does the server run out of RAM or other load factor that causes it to switch to one update/insert per transaction ?
If so, we lose the performance benefit and a high queue can never be serviced, since the system gets much slower when the server does 1 update/xact instead of dozens.
Hope this is clear. We are a heavy Zabbix user on our way to being one of the world's largest at 10 and 100X current sizes, so we need to really understand these dynamics when we're at thousands of updates/second and 10,000 hosts.
But when connectivity came back, the queue didn't go down, even though the DB showed over 675 updates/second and 1400 read/second - the DB was I/O saturated on log syncs. Changing log syncs (mysql sync on commit to 0) raised transaction level to 5000/second and the queue drained in less than a minute.
My question is two-fold:
1) The queue is a bit fuzzy to me but I assume is the number of items past their check time. If this is true and the server is checking more than the 60/second needed to keep up (was doing 1200/sec) why did the queue keep rising? I assume when an item is checked the next check time is set in the future (now + interval), not last scheduled check time plus interval.
2) One of the great 1.8 performance improvements seems to be batch updates, so for 60 updates/second, I only see 1-2 log syncs normally, indicating Zabbix is doing lots of upates/inserts and then committing. But during this high queue time, 675 updates/sec gave me 675 DB sync/sec (actually 1300, for log and binlog) - does the server run out of RAM or other load factor that causes it to switch to one update/insert per transaction ?
If so, we lose the performance benefit and a high queue can never be serviced, since the system gets much slower when the server does 1 update/xact instead of dozens.
Hope this is clear. We are a heavy Zabbix user on our way to being one of the world's largest at 10 and 100X current sizes, so we need to really understand these dynamics when we're at thousands of updates/second and 10,000 hosts.

Comment