Ad Widget

Collapse

Zabbix_proxy stopping with

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • iaono
    Junior Member
    • May 2012
    • 11

    #1

    Zabbix_proxy stopping with

    Hello,

    I am getting an error that causes the zabbix proxy to shut down. Any help to point me in the right direction would be greatly appreciated. =)

    The following output is what I see in the logs at the time it shuts down:

    4858:20120518:141150.209 /usr/sbin/fping: [2] No such file or directory
    4858:20120518:141250.270 /usr/sbin/fping: [2] No such file or directory
    4857:20120518:141342.181 Sending list of active checks to [172.29.25.69] failed: host [c25n69] not found
    4858:20120518:141425.793 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4862:20120518:141425.793 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4851:20120518:141425.793 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4859:20120518:141425.793 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4856:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4822:20120518:141425.794 One child process died (PID:4851,exitcode/signal:19). Exiting ...
    4861:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4857:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4860:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4863:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4855:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4854:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4853:20120518:141425.794 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4852:20120518:141425.795 Got signal [signal:15(SIGTERM),sender_pid:6613,sender_uid:0,re ason:0]. Exiting ...
    4822:20120518:141427.799 syncing history data...
    4822:20120518:141427.800 syncing history data done
    4822:20120518:141427.800 syncing trends data...
    4822:20120518:141427.800 syncing trends data done
    4822:20120518:141427.800 Zabbix Proxy stopped. Zabbix 1.8.10 (revision 24303).
    Last edited by iaono; 23-05-2012, 00:43.
  • Jason
    Senior Member
    • Nov 2007
    • 430

    #2
    Those logs imply the process is being killed off with kill -TERM <process num>

    Also it looks like fping isn't installed or hasn't been configured correctly.

    Comment

    • iaono
      Junior Member
      • May 2012
      • 11

      #3
      Does the zabbix proxy have a feature which automatically kills its own process for various reasons? (timeout, etc.)

      There is enough memory free and no one is killing the processes manually.

      Comment

      • Jason
        Senior Member
        • Nov 2007
        • 430

        #4
        Not something such as SELinux or the like blocking it?

        Comment

        • iaono
          Junior Member
          • May 2012
          • 11

          #5
          Ah, sorry about that! Someone was killing the zabbix processes because all of the nodes located under it gave false node down alerts, it wasn't zabbix killing it's own processes.

          About 10 proxies still are not sending data to the server.
          - The triggers for those clusters say all nodes are down (for a day).
          - Administration -> DM shows that those proxies have been contacted within the last minute.
          - Housekeeper sessions on the proxies jump from cleaning ~2-5k to ~15-30k records. (OfflineBuffer database is set to keep data for 1 hour)

          Comment

          • Jason
            Senior Member
            • Nov 2007
            • 430

            #6
            Check the time period that the proxy updates its config from zabbix server... I've got mine set at 300s as the default seems way to long for me.

            Also is your proxy in active or passive mode and are the agents in active or passive mode

            Comment

            • iaono
              Junior Member
              • May 2012
              • 11

              #7
              Both are in active.

              Making it check the configs more often worked! The server is seeing them for the first two I tried it on. Trying it on the rest =)

              Comment

              • iaono
                Junior Member
                • May 2012
                • 11

                #8
                Restarting one of the other proxies without changing the config fetch time also made it start talking again (don't know why this didn't work yesterday). I did change the remaining ones to 300 and will monitor to test if those clusters have less or no false alerts.

                Just curious (if this is the case): what kind of configs would it need to get from the server so often, because it may stop responding if it does not get them?

                Thanks for all the help!

                Comment

                • iaono
                  Junior Member
                  • May 2012
                  • 11

                  #9
                  Is there any settings to get more information out of log level 3? Using level 4 just fills up the disk very quickly.

                  Comment

                  • mschlegel
                    Member
                    • Oct 2008
                    • 40

                    #10
                    Howdy - I'm a coworker of Iaono.

                    Here's an excerpt of the proxy log surrounding the time where this particular proxy went 'AWOL' as far as the host data is concerned:


                    13490:20120606:141859.317 Executing housekeeper
                    13490:20120606:141859.346 Deleted 3726 records from history [0.023302 seconds]
                    13482:20120606:142020.931 Received configuration data from server. Datalen 121123
                    13490:20120606:151903.021 Executing housekeeper
                    13490:20120606:151903.043 Deleted 2747 records from history [0.016245 seconds]
                    13482:20120606:152024.861 Received configuration data from server. Datalen 121123
                    13490:20120606:161906.701 Executing housekeeper
                    13490:20120606:161906.729 Deleted 3508 records from history [0.021740 seconds]
                    13482:20120606:162028.820 Received configuration data from server. Datalen 121123
                    13490:20120606:171910.378 Executing housekeeper
                    13490:20120606:171910.434 Deleted 7687 records from history [0.050006 seconds]
                    13482:20120606:172032.767 Received configuration data from server. Datalen 121123
                    13490:20120606:181914.056 Executing housekeeper
                    13490:20120606:181914.151 Deleted 13969 records from history [0.089039 seconds]
                    13482:20120606:182036.668 Received configuration data from server. Datalen 121123
                    13490:20120606:191917.776 Executing housekeeper
                    13490:20120606:191917.903 Deleted 18628 records from history [0.120648 seconds]
                    13482:20120606:192040.556 Received configuration data from server. Datalen 121123
                    13490:20120606:201921.531 Executing housekeeper
                    13490:20120606:201921.672 Deleted 20988 records from history [0.135352 seconds]
                    13482:20120606:202044.686 Received configuration data from server. Datalen 121123


                    I wouldn't expect the config fetch time to really change much of anything except for the responsiveness to changes in configuration for hosts. As you can see, the configuration for these hosts does not change surrounding these issues.

                    As far as I've been able to tell, the values are still getting from the proxy to the server, they are just coming through on a very inconsistent pace once this happens - typically delayed somewhere around 15-20 minutes or so.

                    Any idea what to make of the drastic change in the housekeeper behavior? The one very consistent behavior with this proxy behavior is the pattern of a small drop in records deleted, followed by a massive climb. It seems to settle out between 5-6 times the normal rate of deleted records every hour.

                    We have thus far not found any trigger that causes this behavior, which does make it difficult to troubleshoot. Is there perhaps a way to turn up some verbosity on only the proxy->server communication logging without logging every agent contact?

                    All of our proxies are currently in 'Active' mode.

                    Comment

                    • iaono
                      Junior Member
                      • May 2012
                      • 11

                      #11
                      We turned on debugging on a smaller cluster of servers and I see that it cannot recieve the data from the server by the zbx_recv_response function output.

                      Here is the debug log that occur between the successes shown below: http://www.x-iss.com/tmp/proxy_log_snipet.txt

                      Grep of the zbx_recv_response function:
                      28723:20120826:134636.161 zbx_recv_response() '{
                      "response":"success"}'
                      28723:20120826:134636.161 End of zbx_recv_response():SUCCEED
                      28723:20120826:134636.161 End of put_data_to_server():SUCCEED
                      --
                      28723:20120826:134729.285 In zbx_recv_response()
                      28730:20120826:134730.473 In get_values()
                      --
                      28724:20120826:134823.029 zbx_recv_response() ''
                      28724:20120826:134823.029 End of zbx_recv_response():FAIL
                      28724:20120826:134823.029 End of put_data_to_server():FAIL
                      --
                      28723:20120826:134826.163 End of zbx_recv_response():NETWORK_ERROR
                      28723:20120826:134826.163 End of put_data_to_server():SUCCEED
                      --
                      28724:20120826:134828.549 In zbx_recv_response()
                      28731:20120826:134828.711 Trapper got [{
                      --
                      28723:20120826:134926.283 In zbx_recv_response()
                      28729:20120826:134926.304 In get_values()
                      --
                      28723:20120826:135026.165 End of zbx_recv_response():NETWORK_ERROR
                      28723:20120826:135026.165 End of put_data_to_server():SUCCEED
                      --
                      28723:20120826:135029.281 In zbx_recv_response()
                      28730:20120826:135030.491 In get_values()
                      --
                      28723:20120826:135126.166 End of zbx_recv_response():NETWORK_ERROR
                      28723:20120826:135126.166 End of put_data_to_server():SUCCEED
                      --
                      28723:20120826:135129.279 In zbx_recv_response()
                      28726:20120826:135129.317 In get_values()
                      --
                      28723:20120826:135129.513 zbx_recv_response() '{
                      "response":"success"}'
                      28723:20120826:135129.513 End of zbx_recv_response():SUCCEED
                      28723:20120826:135129.514 End of put_data_to_server():SUCCEED

                      Comment

                      Working...