Ad Widget

**NOB** · 18-04-2007, 14:25

Hi,

for me it seems to be probable that the user parameter check
is the culprit, because at least two pipes are involved.

My (wild) guess is:

The agent got a SIGPIPE, corrupted the socket fd at that time or set it to
somethign invalid but still tries to send the answer back
to the server (every second) or something along those lines.

Regards,

Norbert.

**Farzad FARID** · 18-04-2007, 15:37

Originally posted by NOB

Hi,
My (wild) guess is:
[...]
The agent got a SIGPIPE, corrupted the socket fd at that time or set it to
somethign invalid but still tries to send the answer back
to the server (every second) or something along those lines.

Norbert.

Hi, this is close to what I supposed, thanks for your input.

But, as I just used the sample UserParameters provided in the default zabbix_agentd.conf, the use of one or more pipes is a common case. So I'm surprised this SIGPIPE error has not been reported before.

By the way, I must add that both server and agent are compiled and running on a 64bits version of Red Hat Fedora Core 5.

Regards.

**NOB** · 18-04-2007, 17:03

Hi,

I've seen this error now in my zabbix_server.log file, too.
It looks like this can happen if you kill (stop) an agent while a
command is executed. I had to stop two agents for an upgrade
and the SIGPIPE error message appeared at that time on the server.

Norbert.

**Farzad FARID** · 18-04-2007, 17:16

Originally posted by NOB

Hi,

I've seen this error now in my zabbix_server.log file, too.
It looks like this can happen if you kill (stop) an agent while a
command is executed. I had to stop two agents for an upgrade
and the SIGPIPE error message appeared at that time on the server.

Norbert.

So there are actually two potentials problems:

One on the server side, apparently triggered by the interruption of an agent during the execution of a command.
One on the agent side. This happens for me without any agent restarting, but just after (or during) the execution of a UserParameter involving the use of multiple pipes.

I hope this information will be useful to our dear Zabbix developpers

Regards

**dwoodruff** · 02-05-2007, 02:36

I am having a similar problem on v1.3.5 and SLES 10.

Here is what the log reports:
6321:20070501:192732 Got SIGPIPE. Where it came from???
6325:20070501:192732 Got SIGPIPE. Where it came from???
6319:20070501:193332 Got SIGPIPE. Where it came from???
6325:20070501:194415 Too many consecutive errors on accept() call.
6321:20070501:194415 Too many consecutive errors on accept() call.
6319:20070501:195015 Too many consecutive errors on accept() call.

I am also using UserParameter checks.

Thanks,
Donnie

**Farzad FARID** · 03-05-2007, 18:40

Hi,

I'm still investigating this problem, present in version 1.3.6 rev 4084.

For me it only happens on a 64 bits Fedora Core 5 agent, not on a 32 bits Fedora Core 5. Although the 2 agents are running the same template and the same set of items. Does anyone have this problem on a 32 bits architecture too?

Furthermore, I have the impression that the SIGPIPE is not related to the execution of UserParameters, it's just that the socket used by the agent to send back information to the is accidentaly shut down by either the server or the agent.

I tried to follow the logic of the tcp_* routines, but didn't find anything suspicious. The only things I am sure of are:

The problem happens with passive checks only, because it implies the "listerner socket". It happens inside a Listener process.
The SIGPIPE signal is raised in the zbx_tcp_send_ext function (in src/libs/zbxcomms), when the agent tries to write data (in /* Write header */ I think)
Once a Listener process received a SIGPIPE, it becomes unuseable, its TCP socket is dead
After a while, all my Listener processes get the signal. At that time the agent does not answer to any request anymore

But I can't find why or where exactly the socket gets closed. I suspected a misuse of "tcp_unaccept" but it does't seem to be the case. On the server side, the only "suspicious" messages logged (log level = 3) at the exact time of the SIGPIPE are:

Code:

 21549:20070503:121454 Timeout while answering request
 21545:20070503:121620 Timeout while answering request
############### START HERE #########################
 21547:20070503:122118 Timeout while answering request
 21547:20070503:122118 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
 21547:20070503:122118 Host [srv05bc2]: first network error, wait for 15 seconds
################ END HERE ###########################
 21551:20070503:122151 Timeout while answering request
 21553:20070503:122523 Timeout while answering request
 21574:20070503:122543 Executing housekeeper

For the time being I have converted all the items to use Active checks, and the problem has not triggered yet.

Regards

**elp** · 23-05-2007, 17:37

Got SIGPIPE

Hi,

This occurs because version of the server was different of version of agentd

[]s ELP

Originally posted by dwoodruff

I am having a similar problem on v1.3.5 and SLES 10.

Here is what the log reports:
6321:20070501:192732 Got SIGPIPE. Where it came from???
6325:20070501:192732 Got SIGPIPE. Where it came from???
6319:20070501:193332 Got SIGPIPE. Where it came from???
6325:20070501:194415 Too many consecutive errors on accept() call.
6321:20070501:194415 Too many consecutive errors on accept() call.
6319:20070501:195015 Too many consecutive errors on accept() call.

I am also using UserParameter checks.

Thanks,
Donnie

**Farzad FARID** · 23-05-2007, 17:40

Hi

Originally posted by elp

Hi,

This occurs because version of the server was different of version of agentd

[]s ELP

I don't think it's the only reason. On my platform both server and agent are at the same exact version.

Regards

Ad Widget

1.3.5 agent "Got SIGPIPE" and then continous Listener errors

1.3.5 agent "Got SIGPIPE" and then continous Listener errors

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment