Ad Widget

Collapse

Zabbix server fails with "Got SIGPIPE. Where it came from???"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Emir Imamagic
    Member
    • Mar 2008
    • 67

    #1

    Zabbix server fails with "Got SIGPIPE. Where it came from???"

    Hello,

    we're running Zabbix 1.6 SVN rev. 6204. Underlying system is CentOS 5.2, database backend PostgreSQL 8.3 on a 16GB of RAM, 2x quad core Opteron processors machine. We're monitoring around 450 machines with ~ 40000 items and ~ 4000 triggers (new values per second: 230). Agents are mainly version 1.4.5.

    Zabbix server log reports couple of times (~5 in average) a day message:
    Got SIGPIPE. Where it came from???
    Error while sending list of active checks
    but nothing bad happens.

    However, on two occasions number of these messages increases and finally Zabbix server stops receiving any results from agent. On the agent side we see following messages:
    Timeout while answering request
    Getting list of active checks failed. Will retry after 60 seconds

    Since we're using nodata trigger to raise alert for Zabbix agent down this issue is causing a lot of false positives. Could anyone provide us with some clue where are these SIGPIPEs coming from and how to avoid them.

    Thanks in advance,
    emir
  • richlv
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Oct 2005
    • 3112

    #2
    we're seeing those as well, but only on single machine so far - and in agent log instead of server log.
    the machine actually is zabbix server itself.
    we could try finding out common factors - do you have user parameters defined for that host ?
    do you see those messages in serverlog only ? what about agent logs ?
    Zabbix 3.0 Network Monitoring book

    Comment

    • Crazy Marty
      Member
      • Sep 2007
      • 75

      #3
      SIGPIPE is raised on a write operation to a closed (or otherwise broken) pipe -- well, it used to be only on pipes, but now that pipes & sockets share a lot of code in many Unix-/Linux-/Posix-like systems, it applies to sockets, too. So it means that a socket has been closed (by the reader) by the time data is written to it (by the writer).

      It would probably be wise for the code to arrange to catch SIGPIPE, and at least try to report on just which socket (unexpectedly) went away.

      Comment

      • richlv
        Senior Member
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Oct 2005
        • 3112

        #4
        thanks for the info on sockets, didn't know that. yes, as in many cases, improved error message would help a lot in debugging
        Zabbix 3.0 Network Monitoring book

        Comment

        • Emir Imamagic
          Member
          • Mar 2008
          • 67

          #5
          Originally posted by richlv
          we're seeing those as well, but only on single machine so far - and in agent log instead of server log.
          the machine actually is zabbix server itself.
          we could try finding out common factors - do you have user parameters defined for that host ?
          do you see those messages in serverlog only ? what about agent logs ?
          We see these only in server logs. I read about the SIGPIPE signal but I don't understand which part of zabbix server is using pipes. Especially since we're using PostgreSQL database on a different machine via TCP connection.

          Comment

          • Emir Imamagic
            Member
            • Mar 2008
            • 67

            #6
            Now I see that there is an open bug for this issue:

            Probably interesting point here is that both of us are using PostgreSQL database.

            I left additional comment there as well cuz there hasn't been any comments from developers and this is causing our infrastructure a lot of problems.

            Comment

            • Emir Imamagic
              Member
              • Mar 2008
              • 67

              #7
              One obvious thing we forgot to check is lowering the default value RefreshActiveChecks. Based on the debug logs it seems that SIGPIPEs strike in case when DB is under big load and server somehow doesn't manage to answer to agent's request for active checks in timely manner.

              In our setup we used the default value (60s) which creates significant load on server with our number of machines and items. We raised the value to the maximum - 3600 and hoping for the best. Too bad this value can't be configured on server side.

              One thing scares me a bit now - in case when agent didn't get the list it tried again in 60s. Does 60s come from RefreshActiveChecks? Does that mean in our setup agent will wait for 3600s before requesting the list again?

              Thanks,
              emir

              Comment

              • richlv
                Senior Member
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Oct 2005
                • 3112

                #8
                ...which couldn't be our problem, as the agent does not have any active checks assigned (they are explicitly disabled).
                could it be that agent and server problems with this error are different ?
                Zabbix 3.0 Network Monitoring book

                Comment

                • Emir Imamagic
                  Member
                  • Mar 2008
                  • 67

                  #9
                  Originally posted by richlv
                  ...which couldn't be our problem, as the agent does not have any active checks assigned (they are explicitly disabled).
                  could it be that agent and server problems with this error are different ?
                  I would say yes. If you increase the debug level can you at least figure out in which part does the SIGPIPE occur?

                  Comment

                  • richlv
                    Senior Member
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Oct 2005
                    • 3112

                    #10
                    with debug level 4 it shows :

                    Code:
                     11157:20090204:115145 Before
                     11157:20090204:115145 Run remote command [/home/zabbix/bin/hpacucliwrapper controller cache 0 ] Result [1] [1]...
                     11157:20090204:115145 Sending back [1]
                     11157:20090204:115145 Got SIGPIPE. Where it came from???
                     11157:20090204:115145 Process listener error: ZBX_TCP_WRITE() failed [Broken pipe]
                     11158:20090204:115147 Before
                     11158:20090204:115147 Run remote command [/home/zabbix/bin/hpacucliwrapper array 0 B ] Result [1] [1]...
                     11158:20090204:115147 Sending back [1]
                     11158:20090204:115147 Got SIGPIPE. Where it came from???
                     11158:20090204:115147 Process listener error: ZBX_TCP_WRITE() failed [Broken pipe]
                    (both server and client are 1.4.6).
                    so it seems like agent succeeded in it's operations, but sending data to server somehow errored out (though data itself is delivered ok).
                    Zabbix 3.0 Network Monitoring book

                    Comment

                    • Emir Imamagic
                      Member
                      • Mar 2008
                      • 67

                      #11
                      Originally posted by Emir Imamagic
                      One thing scares me a bit now - in case when agent didn't get the list it tried again in 60s. Does 60s come from RefreshActiveChecks? Does that mean in our setup agent will wait for 3600s before requesting the list again?
                      to answer to myself since no-one bothers, answer is no. In case of failure, agent will query for active check after 60s:
                      Getting list of active checks failed. Will retry after 60 seconds

                      cheers,
                      emir

                      Comment

                      • Emir Imamagic
                        Member
                        • Mar 2008
                        • 67

                        #12
                        Originally posted by richlv
                        (both server and client are 1.4.6).
                        so it seems like agent succeeded in it's operations, but sending data to server somehow errored out (though data itself is delivered ok).
                        sorry, don't have a clue. I'm quite sure that your problem is quite different from what we're facing. I suggest you open a bug, but unfortunately it seems Zabbix crew is not keen on solving these problems lately

                        Cheers,
                        emir

                        Comment

                        • Emir Imamagic
                          Member
                          • Mar 2008
                          • 67

                          #13
                          I see that the ZBX 311 regarding problems with PostgreSQL database has been solved. Could we expect that this issue (https://support.zabbix.com/browse/ZBX-518) will be resolved in 1.6.3?

                          Thanks,
                          emir

                          Comment

                          • bennett.lain
                            Junior Member
                            • Mar 2010
                            • 3

                            #14
                            was this ever fixed?!

                            im using 1.4 with no real ability to upgrade right now.

                            this just started happening to us in the last couple of weeks

                            this IS causing issues, we are loosing some of the data our system is trying to collect.

                            is there something that i can do on my end to fix this WITHOUT upgrading?
                            i'd like to know what might have caused this, because im at a loss.

                            Comment

                            Working...