Ad Widget

**Jason** · 23-05-2012, 12:35

Those logs imply the process is being killed off with kill -TERM <process num>

Also it looks like fping isn't installed or hasn't been configured correctly.

**iaono** · 23-05-2012, 15:25

Does the zabbix proxy have a feature which automatically kills its own process for various reasons? (timeout, etc.)

There is enough memory free and no one is killing the processes manually.

**Jason** · 23-05-2012, 15:27

Not something such as SELinux or the like blocking it?

**iaono** · 23-05-2012, 16:16

Ah, sorry about that! Someone was killing the zabbix processes because all of the nodes located under it gave false node down alerts, it wasn't zabbix killing it's own processes.

About 10 proxies still are not sending data to the server.
- The triggers for those clusters say all nodes are down (for a day).
- Administration -> DM shows that those proxies have been contacted within the last minute.
- Housekeeper sessions on the proxies jump from cleaning ~2-5k to ~15-30k records. (OfflineBuffer database is set to keep data for 1 hour)

**Jason** · 23-05-2012, 16:19

Check the time period that the proxy updates its config from zabbix server... I've got mine set at 300s as the default seems way to long for me.

Also is your proxy in active or passive mode and are the agents in active or passive mode

**iaono** · 23-05-2012, 17:20

Both are in active.

Making it check the configs more often worked! The server is seeing them for the first two I tried it on. Trying it on the rest =)

**iaono** · 23-05-2012, 20:28

Restarting one of the other proxies without changing the config fetch time also made it start talking again (don't know why this didn't work yesterday). I did change the remaining ones to 300 and will monitor to test if those clusters have less or no false alerts.

Just curious (if this is the case): what kind of configs would it need to get from the server so often, because it may stop responding if it does not get them?

Thanks for all the help!

**iaono** · 24-05-2012, 14:39

Is there any settings to get more information out of log level 3? Using level 4 just fills up the disk very quickly.

**mschlegel** · 11-06-2012, 22:16

Howdy - I'm a coworker of Iaono.

Here's an excerpt of the proxy log surrounding the time where this particular proxy went 'AWOL' as far as the host data is concerned:

13490:20120606:141859.317 Executing housekeeper
13490:20120606:141859.346 Deleted 3726 records from history [0.023302 seconds]
13482:20120606:142020.931 Received configuration data from server. Datalen 121123
13490:20120606:151903.021 Executing housekeeper
13490:20120606:151903.043 Deleted 2747 records from history [0.016245 seconds]
13482:20120606:152024.861 Received configuration data from server. Datalen 121123
13490:20120606:161906.701 Executing housekeeper
13490:20120606:161906.729 Deleted 3508 records from history [0.021740 seconds]
13482:20120606:162028.820 Received configuration data from server. Datalen 121123
13490:20120606:171910.378 Executing housekeeper
13490:20120606:171910.434 Deleted 7687 records from history [0.050006 seconds]
13482:20120606:172032.767 Received configuration data from server. Datalen 121123
13490:20120606:181914.056 Executing housekeeper
13490:20120606:181914.151 Deleted 13969 records from history [0.089039 seconds]
13482:20120606:182036.668 Received configuration data from server. Datalen 121123
13490:20120606:191917.776 Executing housekeeper
13490:20120606:191917.903 Deleted 18628 records from history [0.120648 seconds]
13482:20120606:192040.556 Received configuration data from server. Datalen 121123
13490:20120606:201921.531 Executing housekeeper
13490:20120606:201921.672 Deleted 20988 records from history [0.135352 seconds]
13482:20120606:202044.686 Received configuration data from server. Datalen 121123

I wouldn't expect the config fetch time to really change much of anything except for the responsiveness to changes in configuration for hosts. As you can see, the configuration for these hosts does not change surrounding these issues.

As far as I've been able to tell, the values are still getting from the proxy to the server, they are just coming through on a very inconsistent pace once this happens - typically delayed somewhere around 15-20 minutes or so.

Any idea what to make of the drastic change in the housekeeper behavior? The one very consistent behavior with this proxy behavior is the pattern of a small drop in records deleted, followed by a massive climb. It seems to settle out between 5-6 times the normal rate of deleted records every hour.

We have thus far not found any trigger that causes this behavior, which does make it difficult to troubleshoot. Is there perhaps a way to turn up some verbosity on only the proxy->server communication logging without logging every agent contact?

All of our proxies are currently in 'Active' mode.

**iaono** · 29-08-2012, 20:45

We turned on debugging on a smaller cluster of servers and I see that it cannot recieve the data from the server by the zbx_recv_response function output.

Here is the debug log that occur between the successes shown below: http://www.x-iss.com/tmp/proxy_log_snipet.txt

Grep of the zbx_recv_response function:

28723:20120826:134636.161 zbx_recv_response() '{
"response":"success"}'
28723:20120826:134636.161 End of zbx_recv_response():SUCCEED
28723:20120826:134636.161 End of put_data_to_server():SUCCEED
--
28723:20120826:134729.285 In zbx_recv_response()
28730:20120826:134730.473 In get_values()
--
28724:20120826:134823.029 zbx_recv_response() ''
28724:20120826:134823.029 End of zbx_recv_response():FAIL
28724:20120826:134823.029 End of put_data_to_server():FAIL
--
28723:20120826:134826.163 End of zbx_recv_response():NETWORK_ERROR
28723:20120826:134826.163 End of put_data_to_server():SUCCEED
--
28724:20120826:134828.549 In zbx_recv_response()
28731:20120826:134828.711 Trapper got [{
--
28723:20120826:134926.283 In zbx_recv_response()
28729:20120826:134926.304 In get_values()
--
28723:20120826:135026.165 End of zbx_recv_response():NETWORK_ERROR
28723:20120826:135026.165 End of put_data_to_server():SUCCEED
--
28723:20120826:135029.281 In zbx_recv_response()
28730:20120826:135030.491 In get_values()
--
28723:20120826:135126.166 End of zbx_recv_response():NETWORK_ERROR
28723:20120826:135126.166 End of put_data_to_server():SUCCEED
--
28723:20120826:135129.279 In zbx_recv_response()
28726:20120826:135129.317 In get_values()
--
28723:20120826:135129.513 zbx_recv_response() '{
"response":"success"}'
28723:20120826:135129.513 End of zbx_recv_response():SUCCEED
28723:20120826:135129.514 End of put_data_to_server():SUCCEED

Ad Widget

Zabbix_proxy stopping with

Zabbix_proxy stopping with

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment