PDA

View Full Version : A lot of trouble with Zabbix 1.1beta7


valain
02-03-2006, 14:55
Hello,

I have installed zabbix_server and zabbix_agentd on a local server where everything works very nicely.

I have then installed zabbix_agentd on a second, remote server that I want to monitor. This remote server is one of our main production machines and works well (i.e. no network trouble or whatsoever).

In my zabbix_server.log file I see a lot of errors with that host. Here's an short extract (real servername and IP address hidden) :


016135:20060302:142804 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg1]]
016135:20060302:142804 Assuming that agent dropped connection because of access permissions
016135:20060302:142804 Started network errors for [SERVERNAME]
016135:20060302:142804 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143112 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[pop]]
016135:20060302:143112 Assuming that agent dropped connection because of access permissions
016135:20060302:143112 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143203 Enabling host [SERVERNAME]
016135:20060302:143238 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/tmp,total]]
016135:20060302:143238 Assuming that agent dropped connection because of access permissions
016135:20060302:143238 Started network errors for [SERVERNAME]
016135:20060302:143238 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143309 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[smtp]]
016135:20060302:143309 Assuming that agent dropped connection because of access permissions
016135:20060302:143309 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143335 Enabling host [SERVERNAME]
016135:20060302:143339 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[imap]]
016135:20060302:143339 Assuming that agent dropped connection because of access permissions
016135:20060302:143339 Started network errors for [SERVERNAME]
016135:20060302:143339 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143444 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg15]]
016135:20060302:143444 Assuming that agent dropped connection because of access permissions
016135:20060302:143444 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143540 Enabling host [SERVERNAME]
016135:20060302:144002 Timeout while receiving data from [SERVERNAME]
016135:20060302:144002 Getting value of [system.users.num] from host [SERVERNAME] failed
016135:20060302:144002 The value is not stored in database.
016135:20060302:144012 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.port[A.B.C.D,10050]]
016135:20060302:144012 Assuming that agent dropped connection because of access permissions
016135:20060302:144012 Started network errors for [SERVERNAME]
016135:20060302:144012 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:144052 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/home,used]]
016135:20060302:144052 Assuming that agent dropped connection because of access permissions
016135:20060302:144052 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:144143 Enabling host [SERVERNAME]
016135:20060302:144233 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vm.memory.size[free]]
016135:20060302:144233 Assuming that agent dropped connection because of access permissions
016135:20060302:144233 Started network errors for [SERVERNAME]
016135:20060302:144233 Host [SERVERNAME]: another network error, wait for 5 seconds

Where can I start looking for the source of these errors, which look like if the remote zabbix_agent sometimes doesn't answer correctly? Strangely enough, the parameters that don't work, are never the same. Sometimes it's vm.memory.XXX, then it's tcp.XXX, then it's user defined parameters... and a couple of seconds later, they all work again.

I also notice that the "status" of this host changes from 0 to 2 and back to 0 every 15 seconds or so. Sometimes it stays "alive" for a couple of minutes before the status value starts to change every couple of seconds again.

Ideas?

valain
02-03-2006, 15:29
Hi,

Something more I have noticed on the host on which the agent seems to have trouble:

* when I "killall zabbix_agentd" and restart it, I have to try anywhere between 1 and 10 times before it starts correctly, and inbetween I have to manually delete the .pid file ;

* I have set the debug level to 4 for that agentd, but the /tmp/zabbix_agentd.log file stays completely empty?

Help? :-)

bytesize
03-03-2006, 17:13
Hi,

I'm seeing exactly the same errors with beta7:

016135:20060302:143339 Host [SERVERNAME]: another network error, wait for 5 seconds

When this error occurs, the trigger is often displayed as grey in the overview page. Even restarting the Zabbix server doesn't clear the status.

John

bytesize
04-03-2006, 10:49
Hi Valain,

I have managed to resolve both these problems. To stop the network errors, I changed the following parameters in the zabbix_server.conf file:

Server=10
StartSuckers=24
StartTrappers=16

I then dropped all the data from the database, and re-imported it. To make this task easier, I keep all my hosts information in a CSV file which can be imported using the bulkloader tool. This step may be uneccessary after you have changed the parameters above and restarted Zabbix, but I like to make sure all my data is consistent after upgrading.

The problem with killing processes is that Zabbix does not shutdown cleanly, hence the pid file still remains. You should download the Init scripts from the Cookbook forum, and use those to start/stop/restart Zabbix cleanly.

I hope this helps!

John