Ad Widget

Collapse

Zabbix Postgresql Locking

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • rsvancara
    Member
    • Jul 2012
    • 42

    #1

    Zabbix Postgresql Locking

    We frequently see sharelock problems with Zabbix when updating the items table. Unlike other Jira bugs, where this is happening on the ids table, we are seeing this on the items table as a result of concurrent updates and row level locking features provided by Postgresql.

    The errors we see are:

    2012-11-01 00:14:29 PDT zabbix zabbix 127.0.0.1 - DETAIL: Process 5985 waits for ShareLock on transaction 48349150; blocked by process 5983.
    Process 5983 waits for ShareLock on transaction 48349166; blocked by process 5985.
    Process 5985: update items set lastclock=1351754060,lastns=334498000,prevvalue=la stvalue,lastvalue='1351754060' where itemid=23420;
    update items set lastclock=1351754061,lastns=339616000,prevvalue=la stvalue,lastvalue='99.967760' where itemid=23421;
    update items set lastclock=1351754066,lastns=667765507,prevvalue=la stvalue,lastvalue='0.137500' where itemid=23897;
    update items set lastclock=1351754066,lastns=669010892,prevorgvalue ='4662521788565',prevvalue=lastvalue,lastvalue='77 696' where itemid=23957;
    update items set lastclock=1351754058,lastns=314361000,prevvalue=la stvalue,lastvalue='478419185664' where itemid=24798;
    update items set lastclock=1351754059,lastns=323575000,prevvalue=la stvalue,lastvalue='0.000000' where itemid=24799;
    update items set lastclock=1351754060,lastns=334898000,prevvalue=la stvalue,lastvalue='0.016120' where itemid=24800;
    update items set lastclock=1351754061,lastns=343300000,prevvalue=la stvalue,lastvalue='0.003640' where itemid=24801;
    update items set lastclock=1351754058,lastns=317769000,prevorg
    Process 5983: update items set lastclock=1351754065,lastns=641967535,prevorgvalue ='29892503160',prevvalue=lastvalue,lastvalue='7791 ' where itemid=23776;
    update items set lastclock=1351754065,lastns=644268407,prevvalue=la stvalue,lastvalue='0.145000' where itemid=23896;
    update items set lastclock=1351754064,lastns=216544563,prevorgvalue ='0',prevvalue=lastvalue,lastvalue='0' where itemid=23955;
    update items set lastclock=1351754065,lastns=642674637,prevorgvalue ='9239423731381',prevvalue=lastvalue,lastvalue='76 35408' where itemid=23956;
    update items set lastclock=1351754053,lastns=273681000,prevvalue=la stvalue,lastvalue='87.648513' where itemid=24073;
    update items set lastclock=1351754053,lastns=282197000,prevvalue=la stvalue,lastvalue='495' where itemid=24733;
    update items set lastclock=1351754054,lastns=284275000,prevvalue=la stvalue,lastvalue='0.000000' where itemid=24734;
    update items set lastclock=1351754053,lastns=278317000,prevvalue=la stvalue,lastvalue='96.814452' where itemid=24793;
    update items set lastclock=1351754054
    2012-11-01 00:14:29 PDT zabbix zabbix 127.0.0.1 - HINT: See server log for query details.

    As of this morning, we have seen this 585 times where transactions appear to be walking on top of each other. This has been an issue since Zabbix 2.0.1 when we started to use Zabbix a large university environment. This issue is causing a performance bottleneck as Postgres must wait for one transaction to finish before committing another.

    Yesterday we had 10056 share lock instances.

    Perhaps 32 StartDBSyncers is too high? Any recommendations? Anyone know anything about zabbix on this forum? I only say this because it seems like people do not really get much help here and the answers seem more like guesses?
  • Colttt
    Senior Member
    Zabbix Certified Specialist
    • Mar 2009
    • 878

    #2
    hello!,

    maybe you can create a bugreport for this?!

    Otherwise you can use pg_pool to manage connection, some other people use it with zabbix (f.koch?) and it works fine..
    Debian-User

    Sorry for my bad english

    Comment

    • rsvancara
      Member
      • Jul 2012
      • 42

      #3
      Created a Bug Report

      I created a bug report. However from what I can see the way zabbix is designed, it will not work well in a large scale, enterprise environment on postgresql due to the way postgresql implements locking and the way zabbix brutally shoves updates into the items table in the database. This is not Postgresql's fault, it is the software engineer(s) just not recognizing the limitations of the database platform and providing the proper optimizations to support it.

      If you are interested, check out dbcache.c and look at how it creates a single large transaction of about 1000 update statements for the items tables. Now imagine a scenario where two or more transactions contain the same itemid? Now you have lock contention and the database must figure out what to do. In some cases there are deadlocks and the database must perform an expensive operation to check for this condition which ultimately leads to other problems.

      At least that has been my observation.

      Comment

      • rsvancara
        Member
        • Jul 2012
        • 42

        #4
        pg_pool

        I do not see how this helps as zabbix's processes hold the connection open so a pool is essentially worthless unless it can perform caching of some kind or transparently fix the problems with lock contention by serializing the data inserts.

        Comment

        • f.koch
          Member
          Zabbix Certified Specialist
          • Feb 2010
          • 85

          #5
          Perhaps 32 StartDBSyncers is too high? Any recommendations? Anyone know anything about zabbix on this forum? I only say this because it seems like people do not really get much help here and the answers seem more like guesses?
          Hi, this is your problem, set the DBSyncer to the default (comment the setting), you don't need more syncer, the high syncer setting is the cause for the sharelock errors. (thats not the real cause, but you don't get it with the default).

          Why do you has set the DB Syncher to 32?

          regards flo
          Last edited by f.koch; 02-11-2012, 10:00.

          Comment

          • Colttt
            Senior Member
            Zabbix Certified Specialist
            • Mar 2009
            • 878

            #6
            Can you please post you bugreport(number)?
            Debian-User

            Sorry for my bad english

            Comment

            • rsvancara
              Member
              • Jul 2012
              • 42

              #7
              StartDBSyncers

              In order to keep up with the workload. We were seeing sharelock and deadlock problems at 6 StartDBSyncers. I can see where this would be problematic but the setting was increased in an effort to increase throughput which clearly is not working.

              Comment

              • f.koch
                Member
                Zabbix Certified Specialist
                • Feb 2010
                • 85

                #8
                Originally posted by rsvancara
                In order to keep up with the workload. We were seeing sharelock and deadlock problems at 6 StartDBSyncers. I can see where this would be problematic but the setting was increased in an effort to increase throughput which clearly is not working.
                hm do you know the internal process monitoring? take a look, if there is a problem with the history syncer processes (thats the name for the DB syncers) you can see this.

                Can you give some infos about your system ? DB Size, new vps, monitored hosts?

                which throughput would to increase?

                Comment

                • rsvancara
                  Member
                  • Jul 2012
                  • 42

                  #9
                  System Specifications

                  Code:
                  Zabbix server is running	Yes	its-zabbix-01.it.wsu.edu:10051
                  Number of hosts (monitored/not monitored/templates)	3830	3743 / 29 / 58
                  Number of items (monitored/disabled/not supported)	256323	241129 / 2622 / 12572
                  Number of triggers (enabled/disabled)[problem/unknown/ok]	197027	197011 / 16  [125 / 0 / 196886]
                  Number of users (online)	66	12
                  Required server performance, new values per second	2159.63	 -
                  Database Server/Zabbix Server (Physical Hardware)
                  Memory: 96GB RAM
                  CPU: 12 Core
                  Disks: 6-15K RPM drives (300GB each) RAID 1+0

                  Two Proxy Servers
                  32GB RAM each
                  8 CPU Cores
                  Mirrored 10RPM-300GB drives.

                  I can see large load spikes that correlate with the db syncers running.

                  Comment

                  • rsvancara
                    Member
                    • Jul 2012
                    • 42

                    #10
                    Additional Information

                    We have the history tables partitioned and the housekeeper is turned off.
                    Additionally I am having to manually vacuum the items table, otherwise it reaches over 44GB in size for just over 200,000 records. Perhaps tuning back the number of syncer processes will help.

                    Comment

                    • f.koch
                      Member
                      Zabbix Certified Specialist
                      • Feb 2010
                      • 85

                      #11
                      Originally posted by rsvancara
                      Code:
                      Zabbix server is running	Yes	its-zabbix-01.it.wsu.edu:10051
                      Number of hosts (monitored/not monitored/templates)	3830	3743 / 29 / 58
                      Number of items (monitored/disabled/not supported)	256323	241129 / 2622 / 12572
                      Number of triggers (enabled/disabled)[problem/unknown/ok]	197027	197011 / 16  [125 / 0 / 196886]
                      Number of users (online)	66	12
                      Required server performance, new values per second	2159.63	 -
                      Database Server/Zabbix Server (Physical Hardware)
                      Memory: 96GB RAM
                      CPU: 12 Core
                      Disks: 6-15K RPM drives (300GB each) RAID 1+0

                      Two Proxy Servers
                      32GB RAM each
                      8 CPU Cores
                      Mirrored 10RPM-300GB drives.

                      I can see large load spikes that correlate with the db syncers running.
                      Load spikes? you mean system load ? , have you tuned your postgres configuration?
                      in such a big environment, you should run the zabbix server and the database on differend hardware.

                      How big is your Database?

                      Comment

                      • rsvancara
                        Member
                        • Jul 2012
                        • 42

                        #12
                        Database Server

                        We have tuned the database server per the postgresql documentation. Here are the changes we have made:

                        Here is our modifications to the default postgresql.conf file:
                        listen_addresses = '*'
                        max_connections = 300 # (change requires restart)
                        shared_buffers = 40GB # min 128kB # WSUIT
                        work_mem = 1024MB # WSU IT
                        maintenance_work_mem = 512MB # WSU IT
                        effective_io_concurrency = 4
                        synchronous_commit = off # immediate fsync at commit # WSUIT
                        wal_buffers = 1024MB
                        checkpoint_segments = 512 # in logfile segments, min 1, 16MB each # WSUIT
                        checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 # WSUIT
                        effective_cache_size = 48GB # WSUIT
                        logging_collector = on # Enable capturing of stderr and csvlog
                        log_directory = '/var/log/pg_log' # directory where log files are written, WSUIT
                        log_filename = 'postgresql-%a.log' # log file name pattern,
                        log_truncate_on_rotation = on # If on, an existing log file of the
                        log_rotation_size = 100MB # Automatic rotation of logfiles will
                        log_min_error_statement = error # values in order of decreasing detail:
                        log_min_duration_statement = 1000 # WSUIT -- Log statements over ten seconds
                        log_line_prefix = '%t %u %d %h - ' # special values:
                        log_lock_waits = on # WSUIT - Turn on logging waits
                        autovacuum_max_workers = 6 # WSUIT -- Increase these for our larger tables
                        datestyle = 'iso, mdy'
                        lc_messages = 'en_US.UTF-8' # locale for system error message
                        lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
                        lc_numeric = 'en_US.UTF-8' # locale for number formatting
                        lc_time = 'en_US.UTF-8' # locale for time formatting

                        postgres=# select version();
                        version
                        --------------------------------------------------------------------------------------------------------------
                        PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit


                        We are unable to run the database on separate hardware at this time. In terms of load, yes, I am referring to high system cpu utilization but I/O is 0% which seems indicative of something waiting for something else.

                        The database cluster occupies about 250GB of disk space.

                        Comment

                        • f.koch
                          Member
                          Zabbix Certified Specialist
                          • Feb 2010
                          • 85

                          #13
                          ok so you have high cpu load but not much I/O waiting when the db syncer are running, so where is the problem?

                          Comment

                          • rsvancara
                            Member
                            • Jul 2012
                            • 42

                            #14
                            Good Question

                            All I know is that a parallel implementation that is running on MySQL on a system with 16GB of RAM is out performing postgres and zabbix, monitoring the same number of items, but not as many hosts.

                            Our queues are backed up and I can not figure out how to get Zabbix and Postgres more data. Additionally the high CPU load, is a result of misconfigured syncers, I do understand that. But normally, our system sits mostly idle, but the queue remains backed up and additionally our data is full of holes.

                            If you have ideas, let me know, but otherwise, I have looked into everything including running the server in debug mode and using strace.

                            We are actively converting everything over to MySQL now. The only issue is going to be housekeeper but I think a simple delete where statement will clean up history tables and allow us to turn off housekeeper.

                            Comment

                            • f.koch
                              Member
                              Zabbix Certified Specialist
                              • Feb 2010
                              • 85

                              #15
                              Have you considered to ask the zabbix support for help? have you a support contract? in such ab big installation it makes sense.

                              I think this is not an problem we can solve only with posting informations in the forum.
                              Last edited by f.koch; 02-11-2012, 21:34.

                              Comment

                              Working...