Ad Widget

**bbrendon** · 06-03-2008, 18:57

This may be related.

http://www.zabbix.com/forum/showthread.php?t=9013

It use to happen about once a week. It has now happened for the past two nights. It always happens late at night, but not the same time.

**xs-** · 07-03-2008, 17:03

I have the same problem. The zabbix server processes still run (some at 80%-100% cpu usage) all nodata triggers fire and it seems almost no data is received.
I am using the postgresql backend. I have tried to search for problems in logfiles but nothing so far.

Setup:
Master node with 3 remote (distributed)
DB server: 2x dual core xeon, 4G ram
Zabbix server: 1x dualcore xeon, 3G ram (also runs webfrontend)
Master server has about 550 monitored hosts, nodes send data for +-150 hosts
Our main problem is the disks utilization on the database server, caused by a lack of disk spindles. We only have 2 disks in mirror mode.

One thing we noticed is that the database server has a load of 8+
We've been busy trying to increase performance on the database server, with some success, load is now 2.5-4 with much better response / mem usage.
So far (fingers crossed) it seems to have helped, but i'll wait till monday for i'll start to cheer.

For those who are interested in the postgresql tweaks
postgresql.conf:
- work_mem = 4MB # Dont know if this helped
- sync = off # dont know if this helped
- checkpoint_segments=6 # this helped!
- enable_seqscan = off # i think this helped!

also remount postgresql's database partition with the noatime option
mount -o remount,noatime <mountpoint>

Last we did was lower the amount of zabbix_server processes
StartPollers=3
StartPollersUnreachable=1
StartTrappers=3
StartPingers=1
StartDiscoverers=0 # we dont use discovery)
StartHTTPPollers=1
This way you will have less concurrent connections and thus less concurrent queries, which will lessen the queue's

Interesting to see is the following
modify / uncomment the following line in the postgresql.conf
log_min_duration_statement = 2000
All queries with a execution time over 2secs will show up in the logfile. The above changes should lower the mount of queries that show up.

Hope this helps

**bbrendon** · 08-03-2008, 09:30

I'm running mysql. If we're having the same problem, at least we know its not the database portion

**xs-** · 10-03-2008, 11:21

Ok, so after the above changes, and reindexing the database, it did lower the db server load considerably but . . . . the problem still remains, although it happens less often.
So its not db related indeed.

To to sum up things:
- Not database type / speed related (so far)
- All nodata triggers fire during the problem, which tells us:
--- Trapper doesnt receive data from active checks
--- The trigger evaluation and alert scripts execution still work
- Normal polls, i.e. snmp checks still work (need to confirm this tho)
- between 1 and 4 processes (different each time) will show up in top using 50%-100% cpu usage. 0.5% is normal.

So it seems the issue lies with the trapper portion of the zabbix_server.

@developers
Any known issues, does the above ring a bell?

**Alexei** · 10-03-2008, 14:13

Originally posted by xs-

- between 1 and 4 processes (different each time) will show up in top using 50%-100% cpu usage. 0.5% is normal.

It look very much like a problem we fixed in pre-1.4.5. Under some circumstances, on connection loss, ZABBIX trapper process may go into an endless loop doing accept() system call.

You may run strace -p <process PID> to see what the 100% CPU process is actially doing.

**xs-** · 11-03-2008, 11:02

Hi,

Ok, so i've waited till the next occurrence of the problem (ofcourse it wouldt trigger for a loong time)

-----------------------8<-----------------------------------
accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
-----------------------8<-----------------------------------
You get the idea . . . .

Alexei, is this the issue you are talking about?

And the most important question, how production-ready is 1.4.5-pre (yes yes the -pre kinda answers it but still . . .)

Thanks!

**bbrendon** · 12-03-2008, 19:48

You should have searched the forums

We’ll be back soon!

http://www.zabbix.com/forum/showthread.php?t=8703

I didn't do anything but lead you to water... I'll have beer instead. heh.

**xs-** · 13-03-2008, 12:16

Yeah i tried ofcourse, but apparently didnt search for the correct words

Problem's fixed with 1.4.5-pre (used nightly release from website->dev)

Thanks!

**bbrendon** · 13-03-2008, 18:56

Originally posted by xs-

Yeah i tried ofcourse, but apparently didnt search for the correct words

Problem's fixed with 1.4.5-pre (used nightly release from website->dev)

Thanks!

Use google to search. I find it works much better.

**bbrendon** · 19-03-2008, 06:36

This issue seems to occur when the database is very busy.

The times when data is no longer collected is close to the times of slow queries listed in Mysql's slow query log!

**bbrendon** · 19-03-2008, 19:59

The title of this thread isn't accurate, move discussiong to this thread:

We’ll be back soon!

http://www.zabbix.com/forum/showthread.php?t=8769

Ad Widget

Zabbix 1.4.4 malfunctions/dies, nothing found in logs yet

Zabbix 1.4.4 malfunctions/dies, nothing found in logs yet

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment