Ad Widget

Collapse

'ids' table very busy

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Markus
    Member
    • Jan 2006
    • 39

    #1

    'ids' table very busy

    Hi

    I just installed a Zabbix/PostgreSQL system and noticed that there are very frequent updates of one row in the 'ids' table. The effect is that even though I am only monitoring a single 24 port Cisco switch, the CPU load on my Zabbix server is permanently around 2. I turned on PostgreSQL's statistics collection and found out that the 'ids' table is being updated by far most frequently. In fact it's soo bad that the normal data collection is being delayed.
    I am using SVN checkout 4239.


    zabbix=# select usename,current_query from pg_stat_activity;
    usename | current_query
    ----------+------------------------------------------------------------------------------------------------
    zabbix | <IDLE>
    zabbix | update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'
    zabbix | <IDLE>
    zabbix | update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'
    postgres | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'
    zabbix | update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | select nextid from ids where nodeid=0 and table_name='events' and field_name='eventid'
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    zabbix | <IDLE>
    (27 rows)




    zabbix=# select * from pg_stat_user_tables order by n_tup_upd desc limit 10;
    relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del
    -------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------
    16448 | public | ids | 4 | 40 | 117204 | 117204 | 0 | 39070 | 0
    16982 | public | profiles | 0 | 0 | 592 | 556 | 0 | 146 | 0
    16898 | public | items | 4 | 6736 | 4907 | 11841 | 0 | 63 | 0
    17105 | public | triggers | 17 | 18394 | 1856 | 1856 | 0 | 48 | 0
    17059 | public | sessions | 38 | 112 | 0 | 0 | 1 | 18 | 0
    16771 | public | functions | 90 | 97380 | 4774 | 4774 | 0 | 4 | 0
    17049 | public | services_links | 0 | 0 | 0 | 0 | 0 | 0 | 0
    17164 | public | valuemaps | 0 | 0 | 0 | 0 | 0 | 0 | 0
    16425 | public | dhosts | 0 | 0 | 0 | 0 | 0 | 0 | 0
    16647 | public | history_uint | 0 | 0 | 6731 | 0 | 0 | 0 | 0
    (10 rows)


    Markus
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    The SQL statements are supposed to be executed only before adding a new event. Do you really have so many events added every second?!

    Note that PostgreSQL is highly inefficient doing many updates for a single table, it just creates new tuples instead of updating old data. Did you run vacuum, btw?
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • Markus
      Member
      • Jan 2006
      • 39

      #3
      Wow, that was a quick reply, I am still investigating.

      To give some more numbers: rows from this table are being fetched about 1800(!) times per minute, and there are about 600 update operations per minute.

      - After I set all hosts to 'not monitored', restarted Zabbix and PostgreSQL and ran VACUUM the problem went away.
      - Then I started monioring one host (Cisco switch) again and the problem reappeared immediately.
      - Next I deleted the Cisco switch, re-added it and get the same problem.
      - Now I disabled the Cisco switch and enabled the Zabbix server host and get the same problem again.

      The events table overall has only 275 rows.

      Markus


      Originally posted by Alexei
      The SQL statements are supposed to be executed only before adding a new event. Do you really have so many events added every second?!

      Note that PostgreSQL is highly inefficient doing many updates for a single table, it just creates new tuples instead of updating old data. Did you run vacuum, btw?

      Comment

      • Alexei
        Founder, CEO
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Sep 2004
        • 5654

        #4
        Please can you set DebugLevel to 4 in server's configuration file, restart ZABBIX, run it for 5 minutes and send the full zabbix_server.log (gzipped!) to a l e x @ z a b b i x . c o m

        Thanks.
        Alexei Vladishev
        Creator of Zabbix, Product manager
        New York | Tokyo | Riga
        My Twitter

        Comment

        • ekenberg
          Junior Member
          • Mar 2006
          • 21

          #5
          Any updates on this problem? I just upgraded from 1.1.4 to 1.4.1 and mysql is getting totally swamped with queries:

          Code:
          update ids set nextid=nextid+1
            WHERE nodeid=0 and table_name='events' and field_name='eventid'
          Load on the server is over 50, it's absolutely crazy.

          Comment

          • ekenberg
            Junior Member
            • Mar 2006
            • 21

            #6
            update: After perhaps an hour or so (while I was digging in the source and adding debug-logging), it suddenly stopped. Don't know why, could it be that some value needed to be reached (counting from zero) after the db upgrade from 1.1 to 1.4?

            Comment

            • ekenberg
              Junior Member
              • Mar 2006
              • 21

              #7
              The same bug hit again after restarting the server - load is now > 50 with around a 100 threads doing
              Code:
              update ids 
                set nextid=nextid+1
                WHERE nodeid=0 and table_name='events' and field_name='eventid'
              and
              Code:
              select nextid
                FROM ids
                WHERE nodeid=0 and table_name='events' and field_name='eventid'
              Please note that this is using MySQL, so it's not specific for Postgres.
              I'm really hoping somebody can understand and fix this bug!

              Comment

              • ekenberg
                Junior Member
                • Mar 2006
                • 21

                #8
                It's happening again, this is crazy!
                After rebooting the server for a hardware change, it's now again doing these (almost) endless updates to nextid. Load is > 50 since an hour back.

                Please, can't somebody help!

                Comment

                • ekenberg
                  Junior Member
                  • Mar 2006
                  • 21

                  #9
                  BUG + PATCH: Race condition

                  There is a serious race condition causing a potentially eternal loop in zabbix-1.4.1/src/libs/zbxdbhigh/db.c in function DBget_maxid()

                  When starting zabbix_server, processes get stuck in:
                  Code:
                  #9  0x080721f6 in zbx_db_vexecute (
                      fmt=0x80a3da0 "update ids set nextid=nextid+1 where nodeid=%d and table_name='%s' and field_name='%s'",
                      args=0xbf6852c4 "") at db.c:370
                  #10 0x0806f5a6 in DBexecute (
                      fmt=0x80a3da0 "update ids set nextid=nextid+1 where nodeid=%d and table_name='%s' and field_name='%s'") at db.c:174
                  #11 0x08071e86 in DBget_maxid (table=0x809259f "events", field=0x8092597 "eventid") at db.c:1686
                  #12 0x08053523 in process_event (event=0xbf685380) at events.c:160
                  #13 0x0806fcd9 in DBupdate_trigger_value (trigger=0xbf685510, new_value=0, now=1185572669, reason=0x0) at db.c:533
                  #14 0x08053e7b in update_triggers (itemid=21426) at functions.c:306
                  #15 0x08059d20 in get_values () at poller.c:377
                  #16 0x0805a29e in main_poller_loop (type=0, num=7) at poller.c:526
                  #17 0x08055e16 in MAIN_ZABBIX_ENTRY () at server.c:700
                  #18 0x08066d9c in daemon_start (allow_root=0) at daemon.c:169
                  #19 0x08055aab in main (argc=1, argv=0xbf6873f4) at server.c:566
                  This is what happens in db.c DBget_maxid()

                  A: DBget_maxid first does "select nextid from ids where nodeid=0 and table_name='events' and field_name='eventid'"

                  B: It finds a row, so it goes on to increment the same nextid by 1: "update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'"

                  C: Then it does the select again: "select nextid from ids where nodeid=0 and table_name='events' and field_name='eventid'"

                  D: It now compares the result from above A and C (ret1 and ret2): "if (ret1+1 == ret2) found = SUCCEED;" This is never true, so the loop goes back to the top again.

                  Between B and C, waiting parallel processes also update the counter, making ret2 - ret1 > 1. The problem manifests itself when starting the server, when a lot of threads rush to the same place in the code, and keeps looping until enough threads by chance get to do B and C atomically without an other thread updating in between.

                  I solved this for now by locking the table ids between B and C:
                  Code:
                  DBexecute("lock tables ids write");
                    /* do B and C */ 
                  DBexecute("unlock tables");
                  Here's the patch to add table locking:
                  Code:
                  --- db.c~       Fri Jun 29 22:50:26 2007
                  +++ db.c        Sat Jul 28 00:25:51 2007
                  @@ -1683,6 +1683,7 @@
                                          ZBX_STR2UINT64(ret1, row[0]);
                                          DBfree_result(result);
                  
                  +                      DBexecute("lock tables ids write");
                                          DBexecute("update ids set nextid=nextid+1 where nodeid=%d and table_name='%s' and field_name='%s'",
                                                  CONFIG_NODEID,
                                                  table,
                  @@ -1692,6 +1693,7 @@
                                                  CONFIG_NODEID,
                                                  table,
                                                  field);
                  +                      DBexecute("unlock tables");
                                          row = DBfetch(result);
                                          if(!row || DBis_null(row[0])==SUCCEED)
                                          {

                  Comment

                  • Alexei
                    Founder, CEO
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Sep 2004
                    • 5654

                    #10
                    Thanks for the fix. It is for MySQL and I'm not sure if it works unchanged for PostgreSQL, Oracle and SQLite.

                    Anyway, please, let me know what storage engine you are using, InnoDB or MyISAM? The operations are supposed to be executed within a single transaction, so I do not quite understand why we have a "collision" here (provided we use PostgreSQ: or MySQL InnoDB). Perhaps this is related to SQL isolation level...

                    The other strange thing is that I do not see this happening in our test environment, regerdless of configuration/size of database/number of hosts/whatever. I've trying to reproduce this problem on MySQL/InnoDB system, no luck so far.
                    Alexei Vladishev
                    Creator of Zabbix, Product manager
                    New York | Tokyo | Riga
                    My Twitter

                    Comment

                    • ekenberg
                      Junior Member
                      • Mar 2006
                      • 21

                      #11
                      I'm using MySQL with MyISAM. The original poster of this thread was using Postgres, so it seems that it is not unique to MyISAM.

                      Postgres uses a different syntax for locking, so I guess my quick fix is not compatible:


                      AFAIK, SQLite doesn't even have a LOCK statement.

                      You might check for storage engine, but this would require you to *know* that the bug only hits MyISAM or Postgres

                      Code:
                      // at startup:
                      IF using_postgres() THEN __db_use_lock_pgsql = TRUE;
                      ELSEIF using_mysql_with_myisam() THEN __db_use_lock_mysql = TRUE;
                      
                      // in db.c DBget_maxid():
                      IF __db_use_lock_pgsql THEN "LOCK TABLE ids for IN EXCLUSIVE MODE..."
                      ELSEIF __db_use_lock_mysql THEN "(un)lock tables..."
                      I admit it's a mess and I don't know enough to tell you exactly what to do with each DB-vendor. But it is a real and serious problem and should be dealt with.

                      More info on my setup:

                      Dual Intel(R) Xeon(TM) CPU 3.00GHz (ie SMP with four cpus visible to Linux), 3Gb memory. Is your test-system using SMP (increased concurrency)?

                      # Status of ZABBIX:
                      • Number of hosts (monitored/not monitored/templates/deleted) 184(155/20/9/0)
                      • Number of items (monitored/disabled/not supported)[trapper] 3340(3325/9/6)[135]
                      • Number of triggers (enabled/disabled)[true/unknown/false] 3939(3933/6)[7/0/3926]
                      • Number of events 13924
                      • Number of alerts 9758


                      # zabbix_server.conf:
                      StartPollers=60
                      StartPollersUnreachable=20
                      StartTrappers=60
                      StartPingers=20
                      StartHTTPPollers=60

                      Comment

                      • Alexei
                        Founder, CEO
                        Zabbix Certified Trainer
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Sep 2004
                        • 5654

                        #12
                        Thanks for the additional details. This issue seems to be critical enough to be fixed in 1.4.2.
                        Alexei Vladishev
                        Creator of Zabbix, Product manager
                        New York | Tokyo | Riga
                        My Twitter

                        Comment

                        • alj
                          Senior Member
                          • Aug 2006
                          • 188

                          #13
                          Originally posted by Alexei
                          Thanks for the additional details. This issue seems to be critical enough to be fixed in 1.4.2.
                          If this bug doesn't affect InnoDB it would be nice to have a choice to not fix it for innodb because global table locking (where it is not needed) might have serious performance hit (maybe daemon can find out during startup if table is myisam and needs locking?).

                          IMO There is no point to use MyISAM anyway because it has very bad write performance compared to InnoDB.

                          Comment

                          • mlange
                            Member
                            • Sep 2008
                            • 78

                            #14
                            We experienced a similar problem today when we moved our database to another host. Activating the server lead to a full queue (15.000 delayed items) and the server only executed statements like "select nextid from ids where table_name='events". The only way to achieve a useful monitoring again was to activate every host step-by-step. Is this a known bug in the implementation? Why is the server stuck when starting with a lot of active hosts/items/triggers?

                            Some numbers:
                            Number of hosts (monitored/not monitored/templates) 452 108 / 29 / 315
                            Number of items (monitored/disabled/not supported) 19311 16464 / 1378 / 1469
                            Number of triggers (enabled/disabled)[true/unknown/false] 16230 13025 / 3205 [11 / 30 / 12984]
                            Required server performance, new values per second 94.6798 -

                            The problem occured on 1.6.3, 1.6.4 and 1.6.5.

                            Thanks
                            Marek
                            Last edited by mlange; 03-07-2009, 21:22. Reason: added version

                            Comment

                            • sid
                              Member
                              • Jul 2006
                              • 44

                              #15
                              Does anyone have any updates to this thread?

                              We just moved to 1.6.2 and noticed that there are too many updates running to that table and causing significant impact.

                              The disable and re-enable option is not sustainable as it eventually reaches a breaking point.

                              We have about 500 host monitored with more to come eventually.

                              Thanks for any input.
                              Sid

                              Comment

                              Working...