Hi All
I'm using zabbix for monitor servers from multiple countries on multiple different time zones. Our existing setup only have passive checks and zabbix sitting on our network. So far no issues and alert coming out almost instantly. But recently we planned to move zabbix outside from our network. I chooses popular and reliable cloud provider and deployed new zabbix server 4.1. Then I export few hosts from old zabbix server and imported to new one. and create separate template for active monitoring and ( and I confirm there is no single passive items on those templates) configure host with active agents. below are my active agent config on centos hosts.
PidFile=/var/run/zabbix/zabbix_agentd.pid
LogFile=/var/log/zabbix/zabbix_agentd.log
LogFileSize=0
DebugLevel=3
Server=x.x.x.x
StartAgents=0
ServerActive=x.x.x.x
#Hostname=
RefreshActiveChecks=60
Include=/etc/zabbix/zabbix_agentd.d/*.conf
First two remote servers no issues. Then I export set of servers ( like 20 ) all are running centos but in different regions and confirmed data is receiving in latest data page. But after 5 minutes alerts starting to fire up saying host unreachable for 5 mins. when I check the zabbix-agent active agent ping data, last check time is always more than 5 mins with the server time ( some even 12 mins). Because of that alerts coming up. Then I allow zabbix passive port on our firewalls and configure one of those host with passive check. then the agent ping last check is instant. There is no heavy latency on network. even though I'm getting host unreachable graphs and other data i working flawlessly.
When I check the zabbix server queue there are lot of items delayed from those hosts. some more than 10 mins. I have no idea what queue window showing. Does it means bottleneck of database ? which is highly unlikely since I'm using high performance cloud sql instance on cloud service. One main reason we switched to active checks it is easy to monitor hosts behind the NAT routers, which in our case, many. but with this issue new deployment becoming useless since it giving so much false alerts. at the moment I have more than 30 hosts with unreachable alert but those host are up and graphing works perfectly
Is there any way to fix this ? I tried set active agent buffer to minimum size and some other options but non of those worked for me.
I'm using zabbix for monitor servers from multiple countries on multiple different time zones. Our existing setup only have passive checks and zabbix sitting on our network. So far no issues and alert coming out almost instantly. But recently we planned to move zabbix outside from our network. I chooses popular and reliable cloud provider and deployed new zabbix server 4.1. Then I export few hosts from old zabbix server and imported to new one. and create separate template for active monitoring and ( and I confirm there is no single passive items on those templates) configure host with active agents. below are my active agent config on centos hosts.
PidFile=/var/run/zabbix/zabbix_agentd.pid
LogFile=/var/log/zabbix/zabbix_agentd.log
LogFileSize=0
DebugLevel=3
Server=x.x.x.x
StartAgents=0
ServerActive=x.x.x.x
#Hostname=
RefreshActiveChecks=60
Include=/etc/zabbix/zabbix_agentd.d/*.conf
First two remote servers no issues. Then I export set of servers ( like 20 ) all are running centos but in different regions and confirmed data is receiving in latest data page. But after 5 minutes alerts starting to fire up saying host unreachable for 5 mins. when I check the zabbix-agent active agent ping data, last check time is always more than 5 mins with the server time ( some even 12 mins). Because of that alerts coming up. Then I allow zabbix passive port on our firewalls and configure one of those host with passive check. then the agent ping last check is instant. There is no heavy latency on network. even though I'm getting host unreachable graphs and other data i working flawlessly.
When I check the zabbix server queue there are lot of items delayed from those hosts. some more than 10 mins. I have no idea what queue window showing. Does it means bottleneck of database ? which is highly unlikely since I'm using high performance cloud sql instance on cloud service. One main reason we switched to active checks it is easy to monitor hosts behind the NAT routers, which in our case, many. but with this issue new deployment becoming useless since it giving so much false alerts. at the moment I have more than 30 hosts with unreachable alert but those host are up and graphing works perfectly
Is there any way to fix this ? I tried set active agent buffer to minimum size and some other options but non of those worked for me.
Comment