Hello,
I have installed zabbix_server and zabbix_agentd on a local server where everything works very nicely.
I have then installed zabbix_agentd on a second, remote server that I want to monitor. This remote server is one of our main production machines and works well (i.e. no network trouble or whatsoever).
In my zabbix_server.log file I see a lot of errors with that host. Here's an short extract (real servername and IP address hidden) :
016135:20060302:142804 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg1]]
016135:20060302:142804 Assuming that agent dropped connection because of access permissions
016135:20060302:142804 Started network errors for [SERVERNAME]
016135:20060302:142804 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143112 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[pop]]
016135:20060302:143112 Assuming that agent dropped connection because of access permissions
016135:20060302:143112 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143203 Enabling host [SERVERNAME]
016135:20060302:143238 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/tmp,total]]
016135:20060302:143238 Assuming that agent dropped connection because of access permissions
016135:20060302:143238 Started network errors for [SERVERNAME]
016135:20060302:143238 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143309 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[smtp]]
016135:20060302:143309 Assuming that agent dropped connection because of access permissions
016135:20060302:143309 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143335 Enabling host [SERVERNAME]
016135:20060302:143339 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[imap]]
016135:20060302:143339 Assuming that agent dropped connection because of access permissions
016135:20060302:143339 Started network errors for [SERVERNAME]
016135:20060302:143339 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143444 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg15]]
016135:20060302:143444 Assuming that agent dropped connection because of access permissions
016135:20060302:143444 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143540 Enabling host [SERVERNAME]
016135:20060302:144002 Timeout while receiving data from [SERVERNAME]
016135:20060302:144002 Getting value of [system.users.num] from host [SERVERNAME] failed
016135:20060302:144002 The value is not stored in database.
016135:20060302:144012 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.port[A.B.C.D,10050]]
016135:20060302:144012 Assuming that agent dropped connection because of access permissions
016135:20060302:144012 Started network errors for [SERVERNAME]
016135:20060302:144012 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:144052 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/home,used]]
016135:20060302:144052 Assuming that agent dropped connection because of access permissions
016135:20060302:144052 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:144143 Enabling host [SERVERNAME]
016135:20060302:144233 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vm.memory.size[free]]
016135:20060302:144233 Assuming that agent dropped connection because of access permissions
016135:20060302:144233 Started network errors for [SERVERNAME]
016135:20060302:144233 Host [SERVERNAME]: another network error, wait for 5 seconds
Where can I start looking for the source of these errors, which look like if the remote zabbix_agent sometimes doesn't answer correctly? Strangely enough, the parameters that don't work, are never the same. Sometimes it's vm.memory.XXX, then it's tcp.XXX, then it's user defined parameters... and a couple of seconds later, they all work again.
I also notice that the "status" of this host changes from 0 to 2 and back to 0 every 15 seconds or so. Sometimes it stays "alive" for a couple of minutes before the status value starts to change every couple of seconds again.
Ideas?
I have installed zabbix_server and zabbix_agentd on a local server where everything works very nicely.
I have then installed zabbix_agentd on a second, remote server that I want to monitor. This remote server is one of our main production machines and works well (i.e. no network trouble or whatsoever).
In my zabbix_server.log file I see a lot of errors with that host. Here's an short extract (real servername and IP address hidden) :
016135:20060302:142804 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg1]]
016135:20060302:142804 Assuming that agent dropped connection because of access permissions
016135:20060302:142804 Started network errors for [SERVERNAME]
016135:20060302:142804 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143112 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[pop]]
016135:20060302:143112 Assuming that agent dropped connection because of access permissions
016135:20060302:143112 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143203 Enabling host [SERVERNAME]
016135:20060302:143238 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/tmp,total]]
016135:20060302:143238 Assuming that agent dropped connection because of access permissions
016135:20060302:143238 Started network errors for [SERVERNAME]
016135:20060302:143238 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143309 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[smtp]]
016135:20060302:143309 Assuming that agent dropped connection because of access permissions
016135:20060302:143309 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143335 Enabling host [SERVERNAME]
016135:20060302:143339 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[imap]]
016135:20060302:143339 Assuming that agent dropped connection because of access permissions
016135:20060302:143339 Started network errors for [SERVERNAME]
016135:20060302:143339 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:143444 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg15]]
016135:20060302:143444 Assuming that agent dropped connection because of access permissions
016135:20060302:143444 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:143540 Enabling host [SERVERNAME]
016135:20060302:144002 Timeout while receiving data from [SERVERNAME]
016135:20060302:144002 Getting value of [system.users.num] from host [SERVERNAME] failed
016135:20060302:144002 The value is not stored in database.
016135:20060302:144012 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.port[A.B.C.D,10050]]
016135:20060302:144012 Assuming that agent dropped connection because of access permissions
016135:20060302:144012 Started network errors for [SERVERNAME]
016135:20060302:144012 Host [SERVERNAME]: another network error, wait for 5 seconds
016135:20060302:144052 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/home,used]]
016135:20060302:144052 Assuming that agent dropped connection because of access permissions
016135:20060302:144052 Host [SERVERNAME] will be checked after [60] seconds
016135:20060302:144143 Enabling host [SERVERNAME]
016135:20060302:144233 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vm.memory.size[free]]
016135:20060302:144233 Assuming that agent dropped connection because of access permissions
016135:20060302:144233 Started network errors for [SERVERNAME]
016135:20060302:144233 Host [SERVERNAME]: another network error, wait for 5 seconds
Where can I start looking for the source of these errors, which look like if the remote zabbix_agent sometimes doesn't answer correctly? Strangely enough, the parameters that don't work, are never the same. Sometimes it's vm.memory.XXX, then it's tcp.XXX, then it's user defined parameters... and a couple of seconds later, they all work again.
I also notice that the "status" of this host changes from 0 to 2 and back to 0 every 15 seconds or so. Sometimes it stays "alive" for a couple of minutes before the status value starts to change every couple of seconds again.
Ideas?
Comment