Ad Widget

**Palmertree** · 23-06-2009, 14:04

Unreachable is determined by the zabbix_server daemon. Make sure you have the following settings in your zabbix_server.conf file. You might need to increase them if there is latency in the network. Also, make sure you route only out of one interface if you have a multi-homed box.

zabbix_server.conf file:

Code:

# After how many seconds of unreachability treat a host as unavailable
UnreachablePeriod=60

# How ofter check host for availability during the unavailability period
UnavailableDelay=60

**nms_user** · 23-06-2009, 14:24

"Unreachable is determined by the zabbix_server daemon"
Yea, but how exactly - as the .status-item doesn't generate constant data flow, how does the server calculate this?

"Make sure you have the following settings in your zabbix_server.conf file"
I played around with this variables already, but have set them accurately to your published values now.

"Also, make sure you route only out of one interface if you have a multi-homed box"
Neither the zabbix machine nor most servers have two or more interfaces / ip's.

Thanks

**nms_user** · 26-06-2009, 07:50

Hello,

I can update this case a little bit.

Changing the timeout-variables didn't solve something - the problem occured again.

This night two servers from the same subsidiary gone "unreachable", but i can ping them from the zabbix-server and the connect to port 10050 is also working fine. And yes, they have only one interface...

Here something from the zabbix_server-logfile (veiled). Maybe the wan-connection had a short blackout, but why doesn't the zabbix-server then pull them out of unreachable as soon as the connection comes up again:

...lots of "Send list of active checks" all the time*...
26152:20090626:021526 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
26153:20090626:021726 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
26121:20090626:021933 Item [Host_srv1:net.tcp.port[192.168.x.1,53]] error: Get value from agent failed: Cannot connect
to [srv1.domain1:10050] [Interrupted system call]
26121:20090626:021933 Parameter [net.tcp.port[192.168.x.1,53]] will be checked after 120 seconds on host [Host_srv1]
26122:20090626:021936 Item [Host_srv1:net.tcp.port[192.168.x.1,389]] error: Get value from agent failed: Cannot connect
to [srv1.domain1:10050] [Interrupted system call]
26122:20090626:021936 Parameter [net.tcp.port[192.168.x.1,389]] will be checked after 120 seconds on host [Host_srv1]
26156:20090626:022047 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
26121:20090626:022202 Item [Host_srv1:net.tcp.port[192.168.x.1,53]] error: Get value from agent failed: Cannot connect
to [srv1.domain1:10050] [Interrupted system call]
26121:20090626:022202 Parameter [net.tcp.port[192.168.x.1,53]] will be checked after 120 seconds on host [Host_srv1]
26122:20090626:022206 Item [Host_srv1:net.tcp.port[192.168.x.1,389]] error: Get value from agent failed: Cannot connect
to [srv1.domain1:10050] [Interrupted system call]
26122:20090626:022206 Parameter [net.tcp.port[192.168.x.1,389]] will be checked after 120 seconds on host [Host_srv1]
26152:20090626:022933 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
26151:20090626:023233 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
26151:20090626:023433 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
...lots of "Send list of active checks" all the time*...

*We don't use active checks right now...

Thanks

**richlv** · 26-06-2009, 10:37

1. if you do not use active checks, disable them in the agentd config (and restart agentd afterwards);

2. could it be that some checks respond, but some don't ? if host has went into unreachable state, zabbix polls one item until data is retrieved. if that item does not respond, zabbix does not attempt to gather data for other items

**nms_user** · 26-06-2009, 11:55

Now i'm a little bit confused.

The Dashboard shows me (panel system status) the servers as unreachable - now since 9 hours x mins.

If I look into the latest data from the two machines, every item check time is actual.

Also the last value from the host.status shows me the server as up:
2009.Jun.26 02:28:46 Up (0)
2009.Jun.26 02:23:25 Unreachable (2)

So the problem is based at the trigger which is installation-default:
{srv1.domain1:status.last(0)}=2

Why does the trigger hold it's fired-state?

**nms_user** · 29-06-2009, 12:06

Hello,

no ideas at all?

Here the same trigger-mess on another machine again:
2009.Jun.29 11:39:40 Up (0) --> after starting the agent again the trigger went away
2009.Jun.29 11:36:39 Unreachable (2) --> here I stopped the agent for some mins.
2009.Jun.27 22:55:01 Up (0) --> trigger didn't release, all the escalation with SMSes after some hours went through
2009.Jun.27 22:52:45 Unreachable (2) --> trigger gone unreachable

I think there are two separate problems:
1) the server-logic sets the host as 'Up', but the trigger doesn't go away
2) the server-logic marks the host as 'Down', but all the items are getting polled

This trigger-problem only occurs on the .status-check. Besides this our zabbix-environment is runnig very fine (no broken installation), only this UNREACHABLE is giving us headaches because of many false positives as SMS-escals.

**richlv** · 29-06-2009, 12:20

could you post screenshots of :

1. latest values for .status item, showing it going to 0;
2. event history for that trigger during the same period (with "Show unknown" option marked).

**nms_user** · 29-06-2009, 13:47

Hi,

see the requested screenshots attached.

**richlv** · 29-06-2009, 14:08

hmm. it indeed looks like two of the data changes (27/22:55:01 and 29/11:36:39) haven't generated events as i'd expect. which particular version is this ?
if you still see such a problem with 1.6.5, maybe it's worth reporting on the tracker. looks like a possible bug and i'm out of ideas

ps. one more idea, though - is that trigger depending on anything ?

**nms_user** · 29-06-2009, 14:41

"which particular version is this ?"
We are on 1.6.5 ...

"maybe it's worth reporting on the tracker"
Opened one now.

"one more idea, though - is that trigger depending on anything?"
Yes, most machines depend on their next gateway/infrastructure component. But they are right, our terminalservers for example depends on the central core switches and have this problem frequently...

Thanks for your assistance!

**flo** · 03-07-2009, 15:08

Hi,

we have the same issue, also running zabbix 1.6.5 and OS is CentOS 5.3

**abix_adamj** · 13-07-2009, 15:54

I want to report the same situation with agentd running on Debian Lenny 5.0.1 on OpenVZ container.Zabbix 1.6.5 server. The same agentd running on normal openSUSE 11.0 (the same machine as Zabbix server) works perfectly good.

Maybe OpenVZ has something to that ?

Adam

**bek99** · 21-07-2009, 22:39

This seems to be an agent.ping issue maybe? Specifically, if I run a copy of netcat to listen on the agentd port, in a looping shell script, I never get any trigger alerts from zabbix server that the agent is down or any of the associated default triggers that go for various checks (disk space, etc) if no response is received.

1.6.5 is version used as well.

Here's the script I used w/ netcat.

#!/bin/bash
while [ 1 = 1 ]
do
nc -l -p 10050
done

-b

**bek99** · 21-07-2009, 22:55

Originally posted by bek99

This seems to be an agent.ping issue maybe? Specifically, if I run a copy of netcat to listen on the agentd port, in a looping shell script, I never get any trigger alerts from zabbix server that the agent is down or any of the associated default triggers that go for various checks (disk space, etc) if no response is received.

1.6.5 is version used as well.

Here's the script I used w/ netcat.

#!/bin/bash
while [ 1 = 1 ]
do
nc -l -p 10050
done

-b

Just to follow up on this, this does not occur on 1.6.4, the behavior is correct and the unreach trigger does alert.

Two zabbix servers - hardy 8.04 64 bit.

one 1.6.5 -
one 1.6.4 -

zabbix agents: 1.6.4 and one 1:1.4.2-4ubuntu3

zabbix - 1.6.5:

If agent was 1.6.4 and replaced w/ loopscript above, it does trigger an unreach with agent.ping and/or proc count of zabbix_agentd
If agent was 1.4.2 and replaced w/ loopscript above, it does NOT trigger an unreach with agent.ping and/or proc count of zabbix_agentd

zabbix 1.6.4:
exhibited correct behavior. If replaced w/ loopscript, unreach trigger does fire with either agent version.

Ad Widget

Another thread for nasty UNREACHABLE-problem

Another thread for nasty UNREACHABLE-problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment