Ad Widget

Collapse

1.3.5 agent "Got SIGPIPE" and then continous Listener errors

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Farzad FARID
    Member
    • Apr 2007
    • 79

    #1

    1.3.5 agent "Got SIGPIPE" and then continous Listener errors

    Hi,

    I'm running Zabbix 1.3.5 (server+agent) on MySQL 5.0.

    On my single test agent, I added some local UserParameter checks (the mysql.* ones of the example config file). Log level is set to 5 (DEBUG).

    Soon after (re)starting the agent, I get these messages in the agent log file :

    Code:
     25862:20070418:123915 Processing request.
     25862:20070418:123915 In check_security()
     25862:20070418:123915 Requested [mysql.uptime]
     25863:20070418:123915 Before
     25863:20070418:123915 Run remote command [mysqladmin -uzabbix -pzabbix status|cut -f2 -d":"|cut -f1 -d"T"] Result [8] [ 88820  ]
     25863:20070418:123915 Sending back [ 88820  ]
     25863:20070418:123915 Got SIGPIPE. Where it came from???
     25863:20070418:123915 Process listener error: Connection from [10.0.1.85] accepted. Allowed servers [10.0.1.85]
     25863:20070418:123915 Listener error: accept() failed [Bad file descriptor]
     25862:20070418:123915 Before
     25862:20070418:123915 Run remote command [mysqladmin -uzabbix -pzabbix status|cut -f2 -d":"|cut -f1 -d"T"] Result [8] [ 88820  ]
     25862:20070418:123915 Sending back [ 88820  ]
     25863:20070418:123916 Listener error: accept() failed [Bad file descriptor]
    After this, the agent seems to runs normaly, but it continuously logs an error message, one per second:
    Code:
     25863:20070418:124533 Listener error: accept() failed [Bad file descriptor]
    I guess the Listener error and the SIGPIPE are related, but I don't know what causes the SIGPIPE. Maybe the UserParameter check which got triggered just before the SIGPIPE is responsible?

    Regards,
  • NOB
    Senior Member
    Zabbix Certified Specialist
    • Mar 2007
    • 469

    #2
    Hi,

    for me it seems to be probable that the user parameter check
    is the culprit, because at least two pipes are involved.

    My (wild) guess is:

    The agent got a SIGPIPE, corrupted the socket fd at that time or set it to
    somethign invalid but still tries to send the answer back
    to the server (every second) or something along those lines.

    Regards,

    Norbert.

    Comment

    • Farzad FARID
      Member
      • Apr 2007
      • 79

      #3
      Originally posted by NOB
      Hi,
      My (wild) guess is:
      [...]
      The agent got a SIGPIPE, corrupted the socket fd at that time or set it to
      somethign invalid but still tries to send the answer back
      to the server (every second) or something along those lines.

      Norbert.
      Hi, this is close to what I supposed, thanks for your input.

      But, as I just used the sample UserParameters provided in the default zabbix_agentd.conf, the use of one or more pipes is a common case. So I'm surprised this SIGPIPE error has not been reported before.

      By the way, I must add that both server and agent are compiled and running on a 64bits version of Red Hat Fedora Core 5.

      Regards.

      Comment

      • NOB
        Senior Member
        Zabbix Certified Specialist
        • Mar 2007
        • 469

        #4
        Hi,

        I've seen this error now in my zabbix_server.log file, too.
        It looks like this can happen if you kill (stop) an agent while a
        command is executed. I had to stop two agents for an upgrade
        and the SIGPIPE error message appeared at that time on the server.

        Norbert.

        Comment

        • Farzad FARID
          Member
          • Apr 2007
          • 79

          #5
          Originally posted by NOB
          Hi,

          I've seen this error now in my zabbix_server.log file, too.
          It looks like this can happen if you kill (stop) an agent while a
          command is executed. I had to stop two agents for an upgrade
          and the SIGPIPE error message appeared at that time on the server.

          Norbert.
          So there are actually two potentials problems:
          • One on the server side, apparently triggered by the interruption of an agent during the execution of a command.
          • One on the agent side. This happens for me without any agent restarting, but just after (or during) the execution of a UserParameter involving the use of multiple pipes.


          I hope this information will be useful to our dear Zabbix developpers
          Regards

          Comment

          • dwoodruff
            Junior Member
            • Mar 2007
            • 7

            #6
            I am having a similar problem on v1.3.5 and SLES 10.

            Here is what the log reports:
            6321:20070501:192732 Got SIGPIPE. Where it came from???
            6325:20070501:192732 Got SIGPIPE. Where it came from???
            6319:20070501:193332 Got SIGPIPE. Where it came from???
            6325:20070501:194415 Too many consecutive errors on accept() call.
            6321:20070501:194415 Too many consecutive errors on accept() call.
            6319:20070501:195015 Too many consecutive errors on accept() call.

            I am also using UserParameter checks.

            Thanks,
            Donnie
            Last edited by dwoodruff; 02-05-2007, 02:38.

            Comment

            • Farzad FARID
              Member
              • Apr 2007
              • 79

              #7
              Hi,

              I'm still investigating this problem, present in version 1.3.6 rev 4084.

              For me it only happens on a 64 bits Fedora Core 5 agent, not on a 32 bits Fedora Core 5. Although the 2 agents are running the same template and the same set of items. Does anyone have this problem on a 32 bits architecture too?

              Furthermore, I have the impression that the SIGPIPE is not related to the execution of UserParameters, it's just that the socket used by the agent to send back information to the is accidentaly shut down by either the server or the agent.

              I tried to follow the logic of the tcp_* routines, but didn't find anything suspicious. The only things I am sure of are:
              • The problem happens with passive checks only, because it implies the "listerner socket". It happens inside a Listener process.
              • The SIGPIPE signal is raised in the zbx_tcp_send_ext function (in src/libs/zbxcomms), when the agent tries to write data (in /* Write header */ I think)
              • Once a Listener process received a SIGPIPE, it becomes unuseable, its TCP socket is dead
              • After a while, all my Listener processes get the signal. At that time the agent does not answer to any request anymore


              But I can't find why or where exactly the socket gets closed. I suspected a misuse of "tcp_unaccept" but it does't seem to be the case. On the server side, the only "suspicious" messages logged (log level = 3) at the exact time of the SIGPIPE are:

              Code:
               21549:20070503:121454 Timeout while answering request
               21545:20070503:121620 Timeout while answering request
              ############### START HERE #########################
               21547:20070503:122118 Timeout while answering request
               21547:20070503:122118 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
               21547:20070503:122118 Host [srv05bc2]: first network error, wait for 15 seconds
              ################ END HERE ###########################
               21551:20070503:122151 Timeout while answering request
               21553:20070503:122523 Timeout while answering request
               21574:20070503:122543 Executing housekeeper
              For the time being I have converted all the items to use Active checks, and the problem has not triggered yet.

              Regards

              Comment

              • elp
                Junior Member
                • Aug 2006
                • 4

                #8
                Got SIGPIPE

                Hi,

                This occurs because version of the server was different of version of agentd

                []s ELP

                Originally posted by dwoodruff
                I am having a similar problem on v1.3.5 and SLES 10.

                Here is what the log reports:
                6321:20070501:192732 Got SIGPIPE. Where it came from???
                6325:20070501:192732 Got SIGPIPE. Where it came from???
                6319:20070501:193332 Got SIGPIPE. Where it came from???
                6325:20070501:194415 Too many consecutive errors on accept() call.
                6321:20070501:194415 Too many consecutive errors on accept() call.
                6319:20070501:195015 Too many consecutive errors on accept() call.

                I am also using UserParameter checks.

                Thanks,
                Donnie

                Comment

                • Farzad FARID
                  Member
                  • Apr 2007
                  • 79

                  #9
                  Hi

                  Originally posted by elp
                  Hi,

                  This occurs because version of the server was different of version of agentd

                  []s ELP
                  I don't think it's the only reason. On my platform both server and agent are at the same exact version.

                  Regards

                  Comment

                  Working...