Ad Widget

**Alexei** · 06-06-2007, 12:17

The SQL statements are supposed to be executed only before adding a new event. Do you really have so many events added every second?!

Note that PostgreSQL is highly inefficient doing many updates for a single table, it just creates new tuples instead of updating old data. Did you run vacuum, btw?

**Markus** · 06-06-2007, 13:10

Wow, that was a quick reply, I am still investigating.

To give some more numbers: rows from this table are being fetched about 1800(!) times per minute, and there are about 600 update operations per minute.

- After I set all hosts to 'not monitored', restarted Zabbix and PostgreSQL and ran VACUUM the problem went away.
- Then I started monioring one host (Cisco switch) again and the problem reappeared immediately.
- Next I deleted the Cisco switch, re-added it and get the same problem.
- Now I disabled the Cisco switch and enabled the Zabbix server host and get the same problem again.

The events table overall has only 275 rows.

Markus

Originally posted by Alexei

The SQL statements are supposed to be executed only before adding a new event. Do you really have so many events added every second?!

Note that PostgreSQL is highly inefficient doing many updates for a single table, it just creates new tuples instead of updating old data. Did you run vacuum, btw?

**Alexei** · 06-06-2007, 13:15

Please can you set DebugLevel to 4 in server's configuration file, restart ZABBIX, run it for 5 minutes and send the full zabbix_server.log (gzipped!) to a l e x @ z a b b i x . c o m

Thanks.

**ekenberg** · 02-07-2007, 15:49

Any updates on this problem? I just upgraded from 1.1.4 to 1.4.1 and mysql is getting totally swamped with queries:

Code:

update ids set nextid=nextid+1
  WHERE nodeid=0 and table_name='events' and field_name='eventid'

Load on the server is over 50, it's absolutely crazy.

**ekenberg** · 02-07-2007, 16:39

update: After perhaps an hour or so (while I was digging in the source and adding debug-logging), it suddenly stopped. Don't know why, could it be that some value needed to be reached (counting from zero) after the db upgrade from 1.1 to 1.4?

**ekenberg** · 23-07-2007, 17:31

The same bug hit again after restarting the server - load is now > 50 with around a 100 threads doing

Code:

update ids 
  set nextid=nextid+1
  WHERE nodeid=0 and table_name='events' and field_name='eventid'

and

Code:

select nextid
  FROM ids
  WHERE nodeid=0 and table_name='events' and field_name='eventid'

Please note that this is using MySQL, so it's not specific for Postgres.
I'm really hoping somebody can understand and fix this bug!

**ekenberg** · 27-07-2007, 12:11

It's happening again, this is crazy!
After rebooting the server for a hardware change, it's now again doing these (almost) endless updates to nextid. Load is > 50 since an hour back.

Please, can't somebody help!

**ekenberg** · 28-07-2007, 00:52

BUG + PATCH: Race condition

There is a serious race condition causing a potentially eternal loop in zabbix-1.4.1/src/libs/zbxdbhigh/db.c in function DBget_maxid()

When starting zabbix_server, processes get stuck in:

Code:

#9  0x080721f6 in zbx_db_vexecute (
    fmt=0x80a3da0 "update ids set nextid=nextid+1 where nodeid=%d and table_name='%s' and field_name='%s'",
    args=0xbf6852c4 "") at db.c:370
#10 0x0806f5a6 in DBexecute (
    fmt=0x80a3da0 "update ids set nextid=nextid+1 where nodeid=%d and table_name='%s' and field_name='%s'") at db.c:174
#11 0x08071e86 in DBget_maxid (table=0x809259f "events", field=0x8092597 "eventid") at db.c:1686
#12 0x08053523 in process_event (event=0xbf685380) at events.c:160
#13 0x0806fcd9 in DBupdate_trigger_value (trigger=0xbf685510, new_value=0, now=1185572669, reason=0x0) at db.c:533
#14 0x08053e7b in update_triggers (itemid=21426) at functions.c:306
#15 0x08059d20 in get_values () at poller.c:377
#16 0x0805a29e in main_poller_loop (type=0, num=7) at poller.c:526
#17 0x08055e16 in MAIN_ZABBIX_ENTRY () at server.c:700
#18 0x08066d9c in daemon_start (allow_root=0) at daemon.c:169
#19 0x08055aab in main (argc=1, argv=0xbf6873f4) at server.c:566

This is what happens in db.c DBget_maxid()

A: DBget_maxid first does "select nextid from ids where nodeid=0 and table_name='events' and field_name='eventid'"

B: It finds a row, so it goes on to increment the same nextid by 1: "update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'"

C: Then it does the select again: "select nextid from ids where nodeid=0 and table_name='events' and field_name='eventid'"

D: It now compares the result from above A and C (ret1 and ret2): "if (ret1+1 == ret2) found = SUCCEED;" This is never true, so the loop goes back to the top again.

Between B and C, waiting parallel processes also update the counter, making ret2 - ret1 > 1. The problem manifests itself when starting the server, when a lot of threads rush to the same place in the code, and keeps looping until enough threads by chance get to do B and C atomically without an other thread updating in between.

I solved this for now by locking the table ids between B and C:

Code:

DBexecute("lock tables ids write");
  /* do B and C */ 
DBexecute("unlock tables");

Here's the patch to add table locking:

Code:

--- db.c~       Fri Jun 29 22:50:26 2007
+++ db.c        Sat Jul 28 00:25:51 2007
@@ -1683,6 +1683,7 @@
                        ZBX_STR2UINT64(ret1, row[0]);
                        DBfree_result(result);

+                      DBexecute("lock tables ids write");
                        DBexecute("update ids set nextid=nextid+1 where nodeid=%d and table_name='%s' and field_name='%s'",
                                CONFIG_NODEID,
                                table,
@@ -1692,6 +1693,7 @@
                                CONFIG_NODEID,
                                table,
                                field);
+                      DBexecute("unlock tables");
                        row = DBfetch(result);
                        if(!row || DBis_null(row[0])==SUCCEED)
                        {

**Alexei** · 28-07-2007, 16:23

Thanks for the fix. It is for MySQL and I'm not sure if it works unchanged for PostgreSQL, Oracle and SQLite.

Anyway, please, let me know what storage engine you are using, InnoDB or MyISAM? The operations are supposed to be executed within a single transaction, so I do not quite understand why we have a "collision" here (provided we use PostgreSQ: or MySQL InnoDB). Perhaps this is related to SQL isolation level...

The other strange thing is that I do not see this happening in our test environment, regerdless of configuration/size of database/number of hosts/whatever. I've trying to reproduce this problem on MySQL/InnoDB system, no luck so far.

**ekenberg** · 29-07-2007, 00:36

I'm using MySQL with MyISAM. The original poster of this thread was using Postgres, so it seems that it is not unique to MyISAM.

Postgres uses a different syntax for locking, so I guess my quick fix is not compatible:

Locking and Tables

http://www.postgresql.org/docs/7.2/static/locking-tables.html

5 How Oracle Database Processes SQL Statements

http://download.oracle.com/docs/cd/B12037_01/appdev.101/b10795/adfns_sq.htm#1024837

AFAIK, SQLite doesn't even have a LOCK statement.

You might check for storage engine, but this would require you to *know* that the bug only hits MyISAM or Postgres

Code:

// at startup:
IF using_postgres() THEN __db_use_lock_pgsql = TRUE;
ELSEIF using_mysql_with_myisam() THEN __db_use_lock_mysql = TRUE;

// in db.c DBget_maxid():
IF __db_use_lock_pgsql THEN "LOCK TABLE ids for IN EXCLUSIVE MODE..."
ELSEIF __db_use_lock_mysql THEN "(un)lock tables..."

I admit it's a mess and I don't know enough to tell you exactly what to do with each DB-vendor. But it is a real and serious problem and should be dealt with.

More info on my setup:

Dual Intel(R) Xeon(TM) CPU 3.00GHz (ie SMP with four cpus visible to Linux), 3Gb memory. Is your test-system using SMP (increased concurrency)?

# Status of ZABBIX:

Number of hosts (monitored/not monitored/templates/deleted) 184(155/20/9/0)
Number of items (monitored/disabled/not supported)[trapper] 3340(3325/9/6)[135]
Number of triggers (enabled/disabled)[true/unknown/false] 3939(3933/6)[7/0/3926]
Number of events 13924
Number of alerts 9758

# zabbix_server.conf:
StartPollers=60
StartPollersUnreachable=20
StartTrappers=60
StartPingers=20
StartHTTPPollers=60

**Alexei** · 29-07-2007, 09:30

Thanks for the additional details. This issue seems to be critical enough to be fixed in 1.4.2.

**alj** · 30-07-2007, 17:46

Originally posted by Alexei

Thanks for the additional details. This issue seems to be critical enough to be fixed in 1.4.2.

If this bug doesn't affect InnoDB it would be nice to have a choice to not fix it for innodb because global table locking (where it is not needed) might have serious performance hit (maybe daemon can find out during startup if table is myisam and needs locking?).

IMO There is no point to use MyISAM anyway because it has very bad write performance compared to InnoDB.

**mlange** · 03-07-2009, 21:21

We experienced a similar problem today when we moved our database to another host. Activating the server lead to a full queue (15.000 delayed items) and the server only executed statements like "select nextid from ids where table_name='events". The only way to achieve a useful monitoring again was to activate every host step-by-step. Is this a known bug in the implementation? Why is the server stuck when starting with a lot of active hosts/items/triggers?

Some numbers:
Number of hosts (monitored/not monitored/templates) 452 108 / 29 / 315
Number of items (monitored/disabled/not supported) 19311 16464 / 1378 / 1469
Number of triggers (enabled/disabled)[true/unknown/false] 16230 13025 / 3205 [11 / 30 / 12984]
Required server performance, new values per second 94.6798 -

The problem occured on 1.6.3, 1.6.4 and 1.6.5.

Thanks
Marek

**sid** · 15-07-2009, 16:50

Does anyone have any updates to this thread?

We just moved to 1.6.2 and noticed that there are too many updates running to that table and causing significant impact.

The disable and re-enable option is not sustainable as it eventually reaches a breaking point.

We have about 500 host monitored with more to come eventually.

Thanks for any input.
Sid

Ad Widget

'ids' table very busy

'ids' table very busy

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment