Ad Widget

**navaho** · 22-07-2016, 23:36

We still have to go back and issue an /etc/init.d/zabbix-agent restart for any vps server that we reboot or these two checks will not start return values. Given the recent xen vulnerabilities that's a lot of vps restarts and then a lot of going back to restart zabbix-agent. Any ideas? anyone?

**troffasky** · 27-07-2016, 14:10

Troubleshooting steps:
- Turn up logging on agent.
- Use zabbix_get to fetch item.

**navaho** · 27-07-2016, 16:15

Thank you, I had not thought of zabbix_get.

**navaho** · 28-07-2016, 17:22

Hi,

Thank you for the help. I'm still not sure what is going on here, but I have more information to work with.

The check that is failing is actually system.localtime. This check runs agentd every 60 seconds. It's got a trigger that is {C_Template_Linux:system.localtime.fuzzytime(30)}= 0

We have zabbbix-server 2.4.3 and the clients are a debian 8. package, 1:2.2.7+dfsg-2, which is 2.2.7 (obviously).

When a vps server is rebooted, this fails 100% of the time until we log in and manually restart the client. We've actually taken to adding a short sleep and restart to rc.local.

With a reminder about zabbix_get from troffasky we turned up the logs and tried the check using zabbix_get when it's working and when it's not.

upon a VPS restart we can see the client request the list of checks

Code:

   759:20160728:144445.509 active checks #11 [getting list of active checks]
   759:20160728:144445.509 In refresh_active_checks() host:'199.255.144.121' port:10051
   757:20160728:144445.511 got [{"response":"success","data":[{"key":"check_dpkg[]","delay":60,"lastlogsize":0,"mtime":0},{"key":"check_mailq","delay":900,"l
astlogsize":0,"mtime":0},{"key":"check_puppet_rc","delay":900,"lastlogsize":0,"mtime":0},{"key":"net.conn[ESTABLISHED]","delay":60,"lastlogsize":0,"mtime":0}
,{"key":"net.conn[TIME_WAIT]","delay":60,"lastlogsize":0,"mtime":0},{"key":"net.if.in[eth0]","delay":60,"lastlogsize":0,"mtime":0},{"key":"net.if.in[lo]","de
lay":60,"lastlogsize":0,"mtime":0},{"key":"net.if.out[eth0]","delay":60,"lastlogsize":0,"mtime":0},{"key":"net.tcp.service[ssh,,522]","delay":60,"lastlogsize
":0,"mtime":0},{"key":"system.cpu.load","delay":60,"lastlogsize":0,"mtime":0},{"key":"system.cpu.util[,idle,]","delay":30,"lastlogsize":0,"mtime":0},{"key":"
system.cpu.util[,iowait,]","delay":30,"lastlogsize":0,"mtime":0},{"key":"system.cpu.util[,system,]","delay":30,"lastlogsize":0,"mtime":0},{"key":"system.cpu.
util[,user,]","delay":30,"lastlogsize":0,"mtime":0},[B]{"key":"system.localtime","delay":60,"lastlogsize":0,"mtime":0}[/B],{"key":"system.swap.in[,pages]","delay":3
0,"lastlogsize":0,"mtime":0},{"key":"system.swap.out[,pages]","delay":30,"lastlogsize":0,"mtime":0},{"key":"system.swap.size[,pused]","delay":60,"lastlogsize
":0,"mtime":0},{"key":"system.swap.size[,total]","delay":60,"lastlogsize":0,"mtime":0},{"key":"system.who","delay":60,"lastlogsize":0,"mtime":0},{"key":"vfs.
fs.size[/,pused]","delay":300,"lastlogsize":0,"mtime":0},{"key":"vm.memory.size[available]","delay":60,"lastlogsize":0,"mtime":0},{"key":"vm.memory.size[free
]","delay":60,"lastlogsize":0,"mtime":0},{"key":"vm.memory.size[total]","delay":3600,"lastlogsize":0,"mtime":0}]}]

Code:

   757:20160728:144445.513 In add_check() key:'system.localtime' refresh:60 lastlogsize:0 mtime:0
   757:20160728:144445.513 End of add_check()

And it checks, but doesn't send the value?

Code:

757:20160728:144445.798 End of process_value():SUCCEED
   747:20160728:144445.798 End of process_value():SUCCEED
   757:20160728:144445.798 for key [system.localtime] received value [1469742285]
   747:20160728:144445.798 for key [system.localtime] received value [1469742285]
   757:20160728:144445.798 In process_value() key:'b-db074.dh01.groupee-inc.net:system.localtime' value:'1469742285'
   747:20160728:144445.798 In process_value() key:'b-db074.dh01.groupee-inc.net:system.localtime' value:'1469742285'
   757:20160728:144445.798 In send_buffer() host:'199.255.146.250' port:10051 values:14/100
   747:20160728:144445.798 In send_buffer() host:'mp003.mz01.groupee-inc.net' port:10051 values:14/100
   757:20160728:144445.798 Will not send now. Now 1469742285 lastsent 1469742285 < 5
   747:20160728:144445.798 Will not send now. Now 1469742285 lastsent 1469742285 < 5
   757:20160728:144445.798 End of send_buffer():SUCCEED
   747:20160728:144445.798 End of send_buffer():SUCCEED
   757:20160728:144445.798 buffer: new element 14
   747:20160728:144445.798 buffer: new element 14
   757:20160728:144445.798 End of process_value():SUCCEED
   747:20160728:144445.798 End of process_value():SUCCEED

Eventually it gets sent?

Code:

 747:20160728:144450.832 JSON before sending [{
        "request":"agent data",
        "data":[
                {
                        "host":"b-db074.dh01.groupee-inc.net",
                        "key":"check_dpkg[]",
                        "value":"AOK",
                        "clock":1469742285,
                        "ns":536084400},
                {
 -snip -
                {
                        "host":"b-db074.dh01.groupee-inc.net",
                        "key":"system.localtime",
                        "value":"1469742285",
                        "clock":1469742285,
                        "ns":798693319},
                {

after that it never appears again in the zabbix-agent log. I'm not actually sure that ANY of the agentd checks are running. I never actually see the agent send data back again.

Until a restart. Once we restart the agent, that check clears, and the agent will then continually and consistently start sending data.

Code:

6250:20160728:081712.143 collector [idle 1 sec]
  6255:20160728:081712.210 In send_buffer() host:'mp003.mz01.groupee-inc.net' port:10051 values:19/100
  6255:20160728:081712.210 JSON before sending [{
	"request":"agent data",
	"data":[
		{
			"host":"b-db074.dh01.groupee-inc.net",
			"key":"net.conn[ESTABLISHED]",
			"value":"61",
			"clock":1469719027,
			"ns":168178879},
		{

-- snip --

		{
			"host":"b-db074.dh01.groupee-inc.net",
			"key":"system.localtime",
			"value":"1469719027",
			"clock":1469719027,
			"ns":196562142},
		{

The entire time that the agent is NOT sending data (or doing anything?) it DOEs show up on the process list

Code:

 8109 ?        S      0:00 /usr/sbin/zabbix_agentd
 8110 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: collector [idle 1 sec]
 8111 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: listener #1 [waiting for connection]
 8112 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: listener #2 [waiting for connection]
 8114 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: listener #3 [waiting for connection]
 8115 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #1 [idle 1 sec]
 8117 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #2 [idle 1 sec]
 8118 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #3 [getting list of active checks]
 8119 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #4 [getting list of active checks]
 8120 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #5 [getting list of active checks]
 8122 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #6 [getting list of active checks]
 8123 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #7 [getting list of active checks]
 8125 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #8 [getting list of active checks]
 8126 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #9 [getting list of active checks]
 8127 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #10 [idle 1 sec]
 8129 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #11 [getting list of active checks]

So, I guess the question here is why does zabbix-agent start on a VPS restart, run momentarily, then fail to send data from there until we restart it?

**navaho** · 11-08-2016, 15:48

Does anyone have any ideas why it is that on a server reboot zabbix_agentd fails to return data to the server until we manually restart it using the init script?

**jonxor** · 16-08-2016, 02:35

Taking some shots in the dark:
Does this item turn to "Not supported" in the web interface during the time when it does not poll data?

That would indicate that it is at least sending some kind of response.

Another thing to check is if there any other init script that specifies a different config file for your local agent?

I would try to set up a zabbix_get of this item on the local

**navaho** · 16-08-2016, 14:44

Hi,

Thank you. They do not turn to unsupported. When we manually do zabbix_get on the server to the client we get correct values. There are not alternate config files. The agentd starts when the system boots and is in the process space, and does nothing. Any checks in the server that are set up as zabbix agent return values. Any that are set to zabbix agent (active) get no values unless/until we go into the server and issue systemctl restart zabbix_agent.

As a really poor work around we've put a script in /usr/local/bin that has a sleep 15 and then a systemctl restart zabbix_agent. In /etc/rc.local we have a call to that script with an &.

It works, but it's really the wrong tool for the job.

We've thought that maybe the order in which things start up on initial boot or reboot might be an issue, perhaps a dependency condition, but some experimenting with that has also been fruitless.

It is frustrating in that it's not one or two hosts, it is all 351 and growing, but looking at these forums and the ticket system it doesn't seem to be common for everyone else.

**jonxor** · 16-08-2016, 16:23

Ah, that is out of my expertise, I only use passive checks. The only thing I could think of besides that, is that perhaps on the proxy/server that the agents are reaching out to, there are too many TCP connections, or for some other reason, the agents can't reach the proxy/server on port 10051.

Ad Widget

Two checks not surviving a reboot.

Two checks not surviving a reboot.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment