I've read a few messages in the forum about how the clients will die when switching from non-active to active on any given item. That does seem to happen to me.
However I have a dilemma where they continue to die periodically, which is a major problem. Restarting the server and clients does not fix the problem.
Usually one of the children dies and it goes away shortly after puking on an active check. Many times the reporting of the active check shows a significant corruption (for example that is supposed to be system[hostname] not em[hostname] below in Example #1. At the same time as the client dying the server says "Can't ignore signal CHLD, forcing to default." Any ideas? I'm pretty lost.
Example #1:
000504:20050709:120203 Active check [em[hostname]] is not supported. Disabled.
000498:20050709:130329 One child process died. Exiting ...
000499:20050709:130329 Got signal. Exiting ...
000500:20050709:130329 Got signal. Exiting ...
000501:20050709:130329 Got signal. Exiting ...
000503:20050709:130329 Got signal. Exiting ...
000502:20050709:130329 Got signal. Exiting ...
Example #2:
009962:20050709:214928 In delete_all_metrics()
009962:20050709:214928 Parsed [diskfree[/logs]:60:0]
009962:20050709:214928 Key [diskfree[/logs]]
009962:20050709:214928 Refresh [60]
009962:20050709:214928 Lastlogsize [0]
009962:20050709:214928 In add check [diskfree[/logs]]
009962:20050709:214928 Parsed [0]
009962:20050709:214928 Key [0]
009962:20050709:214928 Refresh [(null)]
009962:20050709:214928 Lastlogsize [(null)]
009951:20050709:214928 One child process died. Exiting ...
009954:20050709:214928 Got signal. Exiting ...
009958:20050709:214928 Got signal. Exiting ...
009955:20050709:214928 Got signal. Exiting ...
009956:20050709:214928 Got signal. Exiting ...
009959:20050709:214928 Got signal. Exiting ...
009953:20050709:214928 Got signal. Exiting ...
009961:20050709:214928 Got signal. Exiting ...
009960:20050709:214928 Got signal. Exiting ...
009957:20050709:214928 Got signal. Exiting ...
However I have a dilemma where they continue to die periodically, which is a major problem. Restarting the server and clients does not fix the problem.
Usually one of the children dies and it goes away shortly after puking on an active check. Many times the reporting of the active check shows a significant corruption (for example that is supposed to be system[hostname] not em[hostname] below in Example #1. At the same time as the client dying the server says "Can't ignore signal CHLD, forcing to default." Any ideas? I'm pretty lost.
Example #1:
000504:20050709:120203 Active check [em[hostname]] is not supported. Disabled.
000498:20050709:130329 One child process died. Exiting ...
000499:20050709:130329 Got signal. Exiting ...
000500:20050709:130329 Got signal. Exiting ...
000501:20050709:130329 Got signal. Exiting ...
000503:20050709:130329 Got signal. Exiting ...
000502:20050709:130329 Got signal. Exiting ...
Example #2:
009962:20050709:214928 In delete_all_metrics()
009962:20050709:214928 Parsed [diskfree[/logs]:60:0]
009962:20050709:214928 Key [diskfree[/logs]]
009962:20050709:214928 Refresh [60]
009962:20050709:214928 Lastlogsize [0]
009962:20050709:214928 In add check [diskfree[/logs]]
009962:20050709:214928 Parsed [0]
009962:20050709:214928 Key [0]
009962:20050709:214928 Refresh [(null)]
009962:20050709:214928 Lastlogsize [(null)]
009951:20050709:214928 One child process died. Exiting ...
009954:20050709:214928 Got signal. Exiting ...
009958:20050709:214928 Got signal. Exiting ...
009955:20050709:214928 Got signal. Exiting ...
009956:20050709:214928 Got signal. Exiting ...
009959:20050709:214928 Got signal. Exiting ...
009953:20050709:214928 Got signal. Exiting ...
009961:20050709:214928 Got signal. Exiting ...
009960:20050709:214928 Got signal. Exiting ...
009957:20050709:214928 Got signal. Exiting ...
Comment