Hm another question, have you take a look at the postgres performance statistics? have you activated slow queries log? have you checket the zabbix internal monitoring?
Ad Widget
Collapse
Zabbix Postgresql Locking
Collapse
X
-
Yes, Yes, and Yes
We have the slow query log on. That is where I can see the problems with deadlocks and share locks. I can see the DML happening that results in the deadlock as well. In terms of zabbix internal monitoring, I watch it like a hawk and again I have made what I believe are the proper adjustments to accommodate the number of items and hosts we need to monitor. I have turned back the DB syncers to 2 to see if that reduces the deadlock issue.Comment
-
Receiving same error also
Hello:
My system, a Zabbix/pgsql/DRBD cluster, is also receiving similar output
in /var/log/messages as follows:
postgres[28915]: [140-198] #011update items set lastclock=1361938976,,prevorgvalue='32345155',prev value=lastvalue,lastvalue='0' where itemid=37142;
(not the same itemid every time ) The system is logging so many of these that the log msg rate limits them. Its usually happening about every 60 seconds.
I do notice with the admintrator ->queue menu on zabbix dashboard that the queueing has a few requests queued for over an hour.
A google search on the postgress forum turned up this http://www.postgresql.org/message-id...mail.gmail.com
Essentially Jeff Janes suggested the error messages may be a symptom of deadlock: He states " The problem you are seeing is with deadlocks, not with sharelocks.(The fact that transaction waits are implemented as ShareLocks is an
internal implementation, and perhaps the error message should be
changed so as not to leak that to the user).The problem is not with contention in general, but just the special type of contention that leads to cyclic dependencies, i.e. deadlocks. He also suggests "One possibility is that they are
bunching unrelated real-world transactions into fewer database
transactions in an attempt to improve "group commit" performance. In
which case, they should probably not do that and instead turn
synchronous commit off."
My required server performance is only 130 and number of items being monitored about 7000se . Dell R710 servers should be able handle it
I tried doubling my DBsyncers to 8..but that didnt help, nor did reducing them to two. I did receive checkpoint hint messsages in the log also, however increasing the segments from a default of 3 to ten helped alleviate that error message.
Did you find a solution to your problem ?
Thanks
John MatchetComment
-
Postgresql -- Performance
We ended up migrating everything to MySQL. I was reluctant to go down this path as I still prefer Postgresql. However, you are correct. Deadlocks are a problem. There are several issues with the Zabbix architecture that I would like to point out. First as you describe, Zabbix groups a large number of updates into a single transaction. With Row Level locking, this can result in a deadlock if two transactions want to update the same record and causes entire transactions to fail. In our case this translated to holes in the graphs. Additionally, if you look at the IDS table, that stores the "next id" for various tables, my understanding is that zabbix pounds on this table trying to obtain the next identifier which can result in problems as well. I have seen issues even with MySQL where I receive errors "Something Impossible has just happened" Looking through the source code, I believe this is a failure to update the IDS table with the next value due to a dead lock.
I would think either reducing the number of DB syncers might help or reducing the number of updates per transaction would help.
MySQL/InnoDB has options to allow for less strict rules in terms of locking which seems to perform better at the cost of read consistency.
I think the bottom line is that the Zabbix developers have spent more time optimizing for MySQL.
If you figure out how to optimize Postgresql for Zabbix, I would be very interested in knowing how you did it.Comment
-
Bug Report
There are several bug reports already that revolve around this issue. And honestly, it is not what I would consider a bug, but more of an architectural design constraint/flaw. For example, storing the next ID in the IDS table is a point of contention, where you are relying on the database to maintain consistency, but where this falls apart is in larger installations where MySQL/Oracle/Postgresql are tied up with deadlocks. There needs to be another mechanism, possibly some kind of in memory data structure or possibly looking at how you could pre-allocate key spaces. Or alternatively you could use a very large string that is guaranteed to be unique over a very large range of values.
Additionally the concurrent updates of large datasets. I think there should be one process that performances updates or alternatively a process that could possibly coordinate updates for multiple updates, to ensure that the same record is not being updated in any transaction. Obviously one update process would be a bottleneck eventually, but having some process coordinate/broker updates for multiple update processes could scale.
Also, with the changes I am proposing, it opens up the possibility for sharding databases and maybe even looking at the evolution of a zabbix cluster environment and that would make this product superior to anything out there.
Finally the web interface needs to be optimized for handling of larger number of hosts, but let's take one thing at a time.Comment
-
you mean something like this: https://support.zabbix.com/browse/ZBXNEXT-1006 ??
if yes, you must vote it
Debian-User
Sorry for my bad english
Comment
-
same deadlock problem
I just migrated to Zabbix 2.0.5 on PostgreSQL 9.2.4 on a Dell R620 with 16 cores, 64G RAM, 8x15k SAS in RAID 10.
Required server performance, new values per second 1,117.95 and 30,784 items for 540 hosts.
I'm simply trying to copy the CPU Utilization graph from the Linux OS Template and I trigger this sharelock contention.
I migrated to PostgreSQL to take advantage of table partitioning. I'm a little dismayed to hit something so foreboding right off the bat.
This is the postgres log as I attempt to copy the graph.
I haven't seen holes in graphs yet, but I'm sure that's because I just migrated.Code:2013-04-21 20:21:42 PDT LOG: database system is ready to accept connections2013-04-21 20:23:16 PDT LOG: process 13626 still waiting for ShareLock on transaction 91172106 after 1000.072 ms 2013-04-21 20:23:16 PDT STATEMENT: update items set lastclock=1366600994,lastns=122711294,prevorgvalue='34664311100.000000',prevvalue=lastvalue,lastvalue='2470.644372' where itemid=23794; update items set lastclock=1366600994,lastns=141492835,prevorgvalue='429154405.000000',prevvalue=lastvalue,lastvalue='0.000000' where itemid=23799; ... 520 similar lines omitted ... 2013-04-21 20:23:16 PDT LOG: process 13620 still waiting for ShareLock on transaction 91172106 after 1000.064 ms2013-04-21 20:23:16 PDT STATEMENT: update items set lastclock=1366600993,lastns=954673783,prevvalue=lastvalue,lastvalue='0.410000' where itemid=18468; update items set lastclock=1366600994,lastns=23826726,prevorgvalue='144543918159.000000',prevvalue=lastvalue,lastvalue='47565.480536' where itemid=22474; ... 1000 similar lines omitted ... 2013-04-21 20:23:18 PDT LOG: process 13932 detected deadlock while waiting for ShareLock on transaction 91172109 after 1000.076 ms2013-04-21 20:23:18 PDT CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."items" x WHERE "itemid" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x"2013-04-21 20:23:18 PDT STATEMENT: INSERT INTO graphs_items (gitemid,graphid,itemid,drawtype,sortorder,color,yaxisside,calc_fnc,type) VALUES ('24648','7385','82633','1','0','FF5555','0','2','0') 2013-04-21 20:23:18 PDT ERROR: deadlock detected2013-04-21 20:23:18 PDT DETAIL: Process 13932 waits for ShareLock on transaction 91172109; blocked by process 13620. Process 13620 waits for ShareLock on transaction 91172106; blocked by process 13932. Process 13932: INSERT INTO graphs_items (gitemid,graphid,itemid,drawtype,sortorder,color,yaxisside,calc_fnc,type) VALUES ('24648','7385','82633','1','0','FF5555','0','2','0') Process 13620: update items set lastclock=1366600993,lastns=954673783,prevvalue=lastvalue,lastvalue='0.410000' where itemid=18468; update items set lastclock=1366600994,lastns=23826726,prevorgvalue='144543918159.000000',prevvalue=lastvalue,lastvalue='47565.480536' where itemid=22474; Process 13932: INSERT INTO graphs_items (gitemid,graphid,itemid,drawtype,sortorder,color,yaxisside,calc_fnc,type) VALUES ('24648','7385','82633','1','0','FF5555','0','2','0') Process 13620: update items set lastclock=1366600993,lastns=954673783,prevvalue=lastvalue,lastvalue='0.410000' where itemid=18468; update items set lastclock=1366600994,lastns=23826726,prevorgvalue='144543918159.000000',prevvalue=lastvalue,lastvalue='47565.480536' where itemid=22474; update items set lastclock=1366600994,lastns=31786623,prevvalue=lastvalue,lastvalue='0' where itemid=22514; update items set lastclock=1366600993,lastns=937878588,prevvalue=lastvalue,lastvalue='0.000000' where itemid=23823; update items set lastclock=1366600993,lastns=847308377,prevvalue=lastvalue,lastvalue='4553986048' where itemid=23863; update items set lastclock=1366600993,lastns=964036356,prevorgvalue='127891560.000000',prevvalue=lastvalue,lastvalue='0.000000' where itemid=23898; update items set lastclock=1366600994,lastns=38800778,prevvalue=lastvalue,lastvalue='1.930000' where itemid=23924; update items set lastclock=1366600993,lastns=952434581,prevorgvalue='545842338028.000000',prevvalue=lastvalue,lastvalue='20067.682372' where itemid= 2013-04-21 20:23:18 PDT HINT: See server log for query details. 2013-04-21 20:23:18 PDT CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."items" x WHERE "itemid" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 2013-04-21 20:23:18 PDT STATEMENT: INSERT INTO graphs_items (gitemid,graphid,itemid,drawtype,sortorder,color,yaxisside,calc_fnc,type) VALUES ('24648','7385','82633','1','0','FF5555','0','2','0') 2013-04-21 20:23:18 PDT LOG: process 13626 acquired ShareLock on transaction 91172106 after 3436.356 ms 2013-04-21 20:23:18 PDT STATEMENT: update items set lastclock=1366600994,lastns=122711294,prevorgvalue='34664311100.000000',prevvalue=lastvalue,lastvalue='2470.644372' where itemid=23794; update items set lastclock=1366600994,lastns=141492835,prevorgvalue='429154405.000000',prevvalue=lastvalue,lastvalue='0.000000' where itemid=23799; ... 520 similar lines omitted ...
Is there any hope of a fix in the works?
Perhaps I could handle the housekeeping process and MySQL on this new server... I would hate to have to migrate again.Comment
-
The default appears to be 4, so 4.
History syncer process is 8% busy. History write cache is 99.78% free, trend write cache is 66.52% free. CPU idle time is 98% and iowait time is 0.02%.
Here are my changes from the stock zabbix_server.conf.
Here are my changes to the stock postgresql.confCode:StartPollers=10 StartIPMIPollers=1 StartPollersUnreachable=10 StartTrappers=20 StartPingers=15 StartDiscoverers=10 DisableHousekeeping=1 CacheSize=32M HistoryCacheSize=12M TrendCacheSize=8M Timeout=15
By the way, my test now is not to copy the CPU Utilization graph but to try to recreate it. If I add a 5th item and click save I'll get this in the browser with all the other details on the server side posted previously. Adding just 4 items has worked so far.Code:shared_buffers = 32GB # min 128kB temp_buffers = 128MB # min 800kB work_mem = 128MB # min 64kB maintenance_work_mem = 128MB # min 1MB checkpoint_segments = 20 # in logfile segments, min 1, 16MB each effective_cache_size = 48GB # this is conservative given we have 64G default_statistics_target = 1000 # range 1-10000 log_min_duration_statement = 1000 # -1 is disabled, 0 logs all statements log_line_prefix = '%t ' # special values: log_lock_waits = on # log lock waits >= deadlock_timeout deadlock_timeout = 2s
I changed my DBSyncers to 2, restarted zabbix-server and tried the test again. Same result. As expected the history syncer is now 16.4% busy.Code:pg_query(): Query failed: ERROR: deadlock detected DETAIL: Process 8760 waits for ShareLock on transaction 99812014; blocked by process 8620. Process 8620 waits for ShareLock on transaction 99811015; blocked by process 8760. HINT: See server log for query details. CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."items" x WHERE "itemid" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" [include/db.inc.php:511] Error in query [INSERT INTO graphs_items (gitemid,graphid,itemid,drawtype,sortorder,color,yaxisside,calc_fnc,type) VALUES ('24074','6989','82633','0','0','FF5555','0','2','0')] [ERROR: deadlock detected DETAIL: Process 8760 waits for ShareLock on transaction 99812014; blocked by process 8620. Process 8620 waits for ShareLock on transaction 99811015; blocked by process 8760. HINT: See server log for query details. CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."items" x WHERE "itemid" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x"] SQL statement execution has failed "INSERT INTO graphs_items (gitemid,graphid,itemid,drawtype,sortorder,color,yaxisside,calc_fnc,type) VALUES ('24074','6989','82633','0','0','FF5555','0','2','0')".Comment
-
Zabbix - Locking
Zabbix actually has problems with lock contention in the database that prevents it from scaling. Databases are designed to ensure data consistency and as a result implement various algorithms to ensure that no two transactions can update a row simultaneously. Fundamentally, this is where Zabbix runs into problems as you have noticed with lock contention (share locks in postgresql).
MySQL or Postgresql, eventually you will hit a limit in terms of how far Zabbix can scale. We moved to MySQL, but looking at the innodb engine status I can see that the mutexes and spin waits are off the charts. Problem areas for Zabbix:
1. IDS table storing the next unique identifier in a table is a bad idea. I often see the error "Something impossible just happened" but it happens a lot, so it is not uncommon.
2. Somewhere in the write cache, zabbix needs to look at how history data is being inserted and rather depend on the database to figure out what to do, and need to sort an organize updates so that no two transactions are updating the same row in more than one transaction. Something like serializing the updates.
But this does not answer your question on how to fix the problem. If you can eliminate housekeeper, that will help. If you can relax some the databases consistency controls at the risk of losing some data, you are better off.Comment
-
helpful bug fix in 2.0.6?
In my case, the housekeeper is already disabled.2. Somewhere in the write cache, zabbix needs to look at how history data is being inserted and rather depend on the database to figure out what to do, and need to sort an organize updates so that no two transactions are updating the same row in more than one transaction. Something like serializing the updates.
But this does not answer your question on how to fix the problem. If you can eliminate housekeeper, that will help. If you can relax some the databases consistency controls at the risk of losing some data, you are better off.
This looks encouraging. This fix is in 2.0.6.
* https://support.zabbix.com/browse/ZBX-5225
Fortunately it looks like I'm not seeing these deadlocks frequently. They seem to only coincide with my manipulation of templates. I'm using Discovery, but haven't used Low Level Discovery yet.
I did raise the deadlock_timeout in postgres from 1 to 2s, but I don't think that accounts for anything yet.
Code:# grep 'deadlock detected' postgresql-Sun.log |wc -l 55 # grep 'deadlock detected' postgresql-Mon.log |wc -l 0
Comment
-
Templates
We actually had to turn off the zabbix server process when performing mass updates of hosts and templates due to dead lock conditions. This is because the server process and the frontend compete for resources. Not a big deal if you are using proxies since they can cache information, but not an ideal solution either.
Thanks,
RandallComment
Comment