PDA

View Full Version : Zabbix 1.1a9 - agentd-crashes


raa
21-05-2005, 06:42
hi,

i am testing 1.1a9 and i detected some problems in zabbix_agentd - maybe pointer-misstakes?

Systems: debian woody, debian testing
I have configured most of items as 'active'

1) crash after deactivating an active check:
010036:20050520:230733 Active check [uffers]] is not supported. Disabled.
010026:20050520:230833 One child process died. Exiting ...
010033:20050520:230833 Got signal. Exiting ...

it seems, that there is a pointer-problem because the 'b' of buffers is missing in this message.

2) same probleme here but without crash of agentd:
016523:20050521:054147 Active check [wblk]] is not supported. Disabled.
016523:20050521:060748 Active check [_wio]] is not supported. Disabled.
016523:20050521:060948 Active check [ared]] is not supported. Disabled.
016523:20050521:061150 Active check [_rio]] is not supported. Disabled.
016523:20050521:061348 Active check [_wio]] is not supported. Disabled.
016523:20050521:061848 Active check [_wio]] is not supported. Disabled.
016523:20050521:062348 Active check [_wio]] is not supported. Disabled.

3) One time the logfile was filled with more than 10000 of this messages:
007821:20050520:210023 No sleeping
007821:20050520:210023 No sleeping
007821:20050520:210023 No sleeping

agentd used 80% cpu-time at this time (until i killed it)


I am using older versions of zabbix about 2 years, it seems 1.1 would be a great new Version!
many thanks, andi

Alexei
21-05-2005, 07:56
I try to replicate this on my test system today, no luck so far. Are you sure both server and the agent are 1.1alpha9?

Please, may I ask you to run zabbix_agentd with more debug information? Thanks.

raa
21-05-2005, 08:10
I try to replicate this on my test system today, no luck so far. Are you sure both server and the agent are 1.1alpha9?
Yes!


Please, may I ask you to run zabbix_agentd with more debug information? Thanks.
Yes of course.

I had an idea: i compiled the agentd on my debian-testing, but my clients are debian-woody and i did not use the --enable-static.
No i made 2 tests: one client running agentd with --enable-static
And one with localy compiled version of agentd

I'll report if i have some results...

andi

raa
21-05-2005, 08:27
I had an idea: i compiled the agentd on my debian-testing, but my clients are debian-woody and i did not use the --enable-static.
No i made 2 tests: one client running agentd with --enable-static
And one with localy compiled version of agentd

I'll report if i have some results...

Ok, it happened faster than i thought:

023884:20050521:081429 Active check [wblk]] is not supported. Disabled.
023884:20050521:081632 Active check [lk]] is not supported. Disabled.
023884:20050521:081828 Active check [_wio]] is not supported. Disabled.
023884:20050521:082132 Active check [lk]] is not supported. Disabled.
023884:20050521:082232 One child process died. Exiting ...
023884:20050521:082232 Got signal. Exiting ...

the bug appears on both clients...

andi

Alexei
21-05-2005, 08:39
Please, replace your src/zabbix_agent/active.c with attached file (latest from CVS) and report results. This must fix high CPU usage issue, at least.

Thank you!

raa
21-05-2005, 10:09
Please, replace your src/zabbix_agent/active.c with attached file (latest from CVS) and report results. This must fix high CPU usage issue, at least.

Thank you!
Done, but sorry, the agentd crashed again.:

024578:20050521:090238 One child process died. Exiting ...
024580:20050521:090238 Got signal. Exiting ...
024587:20050521:090238 Got signal. Exiting ...
024586:20050521:090238 Got signal. Exiting ...

and

025238:20050521:092704 Active check [_wio]] is not supported. Disabled.
025228:20050521:092903 One child process died. Exiting ...
025229:20050521:092903 Got signal. Exiting ...

The high cpu-usage issue happened only one time last night, so i can't currently verify that this version corrects this.

andi

Wolfgang
21-05-2005, 12:13
@raa
pls note that debian-woody uses an older version of libc than debian-testing does. So they are incomaptible and you must compile static.

raa
21-05-2005, 12:22
@raa
pls note that debian-woody uses an older version of libc than debian-testing does. So they are incomaptible and you must compile static.
Yes of course, therefore i compiled the agent with --enable-static witch didn't solve the problem.
Also i compiled the agentd directly in the client-server - the same.
On the other site, i have agentd also running on the same host as the server (testing) and the same error exists.

andi

Alexei
22-05-2005, 04:05
I cannot reproduce it still. Hmm... interesting.

raa
22-05-2005, 16:36
I cannot reproduce it still. Hmm... interesting.
no problem, max. 30min to the next crash here ....

But i have new facts:
On my zappix_server-host the agentd crashed only 1x.
Also the message "Active check [...]] is not supported. Disabled." appears very rare.
This is the solely host, where the communication is not tunneled through ssh.
All other hosts are connected via ssh-tunnels

I had changed all Items to "Zabbix agent" there was no crash over 7 hours.

Then i changed all Items to "Zabbix agent active" and immediatly comes the Message (more than 100000 times, the logfile rotates some times):
003035:20050522:xxxxxx No sleeping
zappix_agentd uses about 97% cputime and after a view minutes it crashes.

Alexei, i can you send a sql-dump of my configuration, also some logfiles if usable?

andi

Alexei
22-05-2005, 19:31
Then i changed all Items to "Zabbix agent active" and immediatly comes the Message (more than 100000 times, the logfile rotates some times):
003035:20050522:xxxxxx No sleeping
I don't get it. The message may appear if and only if DebugLevel is set to 4, but I think you have it set to 3. So, my guess is that your active.c is outdated or something, however it's quite hard to believe.

On my test system it works perfectly, no crashes at all with agent running for days with delay 1 second for all active items, weird...

raa
16-06-2005, 09:19
Hi, sorry for my long delay...
I don't get it. The message may appear if and only if DebugLevel is set to 4, but I think you have it set to 3. So, my guess is that your active.c is outdated or something, however it's quite hard to believe.

On my test system it works perfectly, no crashes at all with agent running for days with delay 1 second for all active items, weird...
The bug only appeared if the connection was made through a ssh-tunnel.
Every direct connection did work.
Unfortunately most of my clients are firewalled, so i need ssh-tunnels.

Today i testet the new version 1.1alpha10:
Whatever you changed, it was a good work.
The bug seems to be fixed :)

regards, andi