Ad Widget

**Niels** · 02-06-2008, 08:06

I've also had my server crash on faulty data from a node. I made a post about it which was ignored.

**vinny** · 02-06-2008, 12:40

I have seen this behaviour too...
the problem is that zabbix server process do not seem to perform check task or cast value before updating the db...
so when the db returns an error the server crash

For me, i have to truncate history tables sometimes...

**ptietjens** · 04-06-2008, 17:03

Ditto

I just discovered this little bug in the 1.45 code. Took me some time to figure out why the server was just suddenly stopping when receiving node data. Not something that fills me with confidence going forward with a server meant to ADD reliability to a network.

**erozen** · 06-06-2008, 11:33

Something else that's always bothered me - if one thread dies (for whatever reason), the WHOLE server/agent shuts down! What's the rationale for this? Surely, it would make sense to attempt to restart that thread, and if the thread keeps dying, shut it down, notify the admin, and keep the rest running? I don't want my monitoring system turning off at 3am because a string was one byte longer than expected....

In the same thread - if an item doesn't return a valid value, the server stops monitoring it. So, if i'm tweaking my config, and temporarily break something - it's no longer being monitored, until i turn it back on! On a couple of occasions, we've had a major problems and downtime, and our ops team has asked 'why didn't we get an alert?' - it's rather embarassing to have to say 'Our monitoring system turned that metric off at some point in the past, for some unknown reason'

**cstackpole** · 06-06-2008, 17:43

Originally posted by Niels

I made a post about it which was ignored.

Unfortunatly there are too many of those ignored posts.

Originally posted by erozen

Something else that's always bothered me - if one thread dies (for whatever reason), the WHOLE server/agent shuts down! What's the rationale for this? Surely, it would make sense to attempt to restart that thread, and if the thread keeps dying, shut it down, notify the admin, and keep the rest running? I don't want my monitoring system turning off at 3am because a string was one byte longer than expected....

In the same thread - if an item doesn't return a valid value, the server stops monitoring it. So, if i'm tweaking my config, and temporarily break something - it's no longer being monitored, until i turn it back on! On a couple of occasions, we've had a major problems and downtime, and our ops team has asked 'why didn't we get an alert?' - it's rather embarassing to have to say 'Our monitoring system turned that metric off at some point in the past, for some unknown reason'

THIS! I cry a bit on the inside whenever I have to say 'Our monitoring system turned that metric off at some point in the past, for some unknown reason'. Of course I do my best to figure out the why and fix it, but it is hard to catch them when you have ~70 items on each of ~30 hosts. I am currently fighting this problem as one trigger (just a single one trigger on one host that is no different then any of the other hosts and triggers ) turns itself off randomly about 3 times a month. Inevitably I am going to get an angry phone call within the hour that it turns off (it is a very chatty trigger and people know when it stops). I still don't have a proper answer for them. If this wasn't a chatty trigger I probably wouldn't ever know which leads me to worry about the triggers that aren't chatty. Are they still running right? This is the sole reason for which I am building my own regression tests that I will use ~once a month to test as many triggers as I can.

I am of the personal opinion that the reliability should be priority #1 and that it should take an act of God (or at very least root

) to take down the zabbix server. I am seeing improved stability and performance on a regular basis so I am happy at the moment.

**erozen** · 10-06-2008, 12:39

I should explain my workaround/fix a little better.

On both (all) zabbix databases, you need to run the following sql:

Code:

DELETE FROM history_str WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history_uint WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history_str_sync WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history_uint_sync WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history_sync WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history_log WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history_str WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE FROM history WHERE itemid NOT IN (SELECT itemid FROM items);

This will clear out all the orphan history data. In theory it should be 0 for all tables, but in practise you'll have a lot. I had about 250,000 in my string table - so this is good housekeeping, even if nothing is wrong!

Zabbix chaps:

Is it worth doing this sort of thing in the housekeeper process?
Are there any other SQL integrity checks like this i can do to clean out bad data?
Foreign key contraints and cascading deletes would really help here.

Now, the only problem with this sql is, it's really slow. Especially if you have lot's of items, and especially if you use mysql, which optimises "IN ()" really badly. You can help with the last one by rewriting the query to use a join instead of subselect - but this is much easier to mistype or otherwise get wrong, so i don't recommend it. If you need to though, the syntax is:

Code:

DELETE FROM history h OUTER JOIN items i ON h.itemid = i.itemid WHERE i.itemid IS NULL;

Depending on the DBMS, you will also lock tables - so your ability to monitor servers is diminished while you do this.

**Niels** · 11-06-2008, 10:48

Here's an example of the two methods:

Code:

mysql> SELECT COUNT(*) FROM history_uint WHERE itemid NOT IN (SELECT itemid FROM items);
+----------+
| count(*) |
+----------+
|  4721997 |
+----------+
1 row in set (9 min 34.17 sec)

mysql> SELECT COUNT(*) FROM history_uint LEFT JOIN items ON history_uint.itemid=items.itemid WHERE items.itemid IS NULL;
+----------+
| COUNT(*) |
+----------+
|  4722488 |
+----------+
1 row in set (1 min 24.25 sec)

The numbers don't add up, so maybe the SQls don't match.

**erozen** · 11-06-2008, 11:35

Interesting.

I wonder if there are still new values going in there? The fact that the numbers differ by only 500 out of 4 million i find... suspicious. :-) What does a vanilla count(*) give you?

I ran it against my database, and got this:

Code:

mysql> SELECT COUNT(*) FROM history_uint WHERE itemid NOT IN (SELECT itemid FROM items);
+----------+
| COUNT(*) |
+----------+
|   114938 |
+----------+
1 row in set (12 min 26.51 sec)

mysql> SELECT COUNT(*) FROM history_uint LEFT JOIN items ON history_uint.itemid=items.itemid WHERE items.itemid IS NULL;

+----------+
| COUNT(*) |
+----------+
|   114938 |
+----------+
1 row in set (3 min 49.86 sec)

mysql> SELECT COUNT(*) FROM history_uint;
+----------+
| COUNT(*) |
+----------+
| 15683505 |
+----------+
1 row in set (3 min 21.01 sec)

mysql>

Could you try either adding a datetime restriction (don't forget, the housekeeper runs aswell, so it's needs a lower and upper bound), or wrapping both queries in a 'LOCK TABLE/UNLOCK TABLE' or 'BEGIN WORK/ROLLBACK WORK', depending on your engine?

As a side note - I'm actually rather upset that i have 115,000 bad items, only a matter of weeks since i cleaned them out! I don't *think* I've changed any items since i spring cleaned, but....

**Niels** · 11-06-2008, 12:06

I made a simple restriction on the clock field:

Code:

mysql> select count(*) FROM history_uint WHERE clock<1213177652 AND itemid NOT IN (SELECT itemid FROM items);
+----------+
| count(*) |
+----------+
|  4724668 |
+----------+
1 row in set (4 min 37.80 sec)

mysql> SELECT COUNT(*) FROM history_uint LEFT JOIN items ON history_uint.itemid=items.itemid WHERE history_uint.clock<1213177652 AND items.itemid IS NULL;
+----------+
| COUNT(*) |
+----------+
|  4724668 |
+----------+
1 row in set (1 min 27.73 sec)

This would seem to indicate that faulty data are still being added, but we'll know that for sure when I remove the currently orphaned data.

**Niels** · 11-06-2008, 14:04

OK, I've been looking through the history tables, and it looks like I'll be able to remove "millions and millions" of records. That's certainly enticing, I just need to build up some courage before I go through with it...

**erozen** · 11-06-2008, 16:34

I think a little mysqldump might be in order beforehand... :-)

I would be very interested to see some input from the zabbix guys on this - they may have some good ideas on other ways to deal with it, non-obvious dangers, potential for serious damage, performance etc etc

Good luck with your expunging - let us know how it works out once you've plucked up the courage (I'd be tempted by the dutch variety myself)!

**qix** · 11-06-2008, 17:02

Could this issue be related?

We’ll be back soon!

http://www.zabbix.com/forum/showthread.php?t=8713

**erozen** · 12-06-2008, 15:04

Wow, that's a curly one. It could well be related - sounds like the kind of thing you could spend weeks trying to debug...

Are you able to reproduce it at all?

**qix** · 12-06-2008, 17:00

Nope, I can't anymore.

The whole distributed nodes things was way to buggy to my taste so I dropped the whole concept. I reinstalled my nodes are two standalone nodes now.

I think I'll try distributed nodes again when 1.6 is released.
I will probably go for the the 'ZABBIX proxy' setup where a 'dumb' box cellects data and sends it to the central server.

Ad Widget

Bug report: Master zabbix server crashes on receipt of invalid string data

Bug report: Master zabbix server crashes on receipt of invalid string data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment